Detection of Suicidal Tweets Based on Naïve Bayes Algorithm
Norlina Mohd Sabri*, Noor Alisa Mohamad
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Cawangan Terengganu, Kuala Terengganu, Terengganu, Malaysia
*Corresponding Author: [email protected] Accepted: 15 October 2022 | Published: 1 November 2022
DOI:https://doi.org/10.55057/ijarti.2022.4.3.6
_________________________________________________________________________________________
Abstract:The pandemic has created a trend where most people have become addicted to the social media to express thoughts, opinions, share reviews and also share their daily life online.
This has contributed to the increasing volume of data over the social media platforms. Based on these situations, the social media data have been analyzed for the benefits of businesses and organizations. Various sentiment analysis research has been conducted to obtain public opinions such as on products and also on daily situations. Twitter has become one of the most visited social media due to the platform’s convenience and easiness in texting messages. This research has proposed to analyze the Twitter data for the detection of suicidal intention. This is to help many depressed people who have been affected by the Covid-19 pandemic by analyzing their opinions and thoughts online. Identifying this suicidal behaviour in tweets can be the first step in the suicide prevention. The objective of the research is to explore the Naïve Bayes algorithm’s capability to detect the suicide ideation tweets. The Twitter data were scrapped during the Malaysia’s pandemic lockdown in May 2021 using the Tweepy library.
There were 5439 Twitter data that have been scrapped based on “stress”, “anxiety”,
“depression” and “suicide” keywords. The evaluation results have shown that the algorithm has produced good and acceptable performance in detecting the suicidal tweets with 80.39%
accuracy. The result also has shown that more people are actually depressed duirng the pandemic, especilly during the lockdown. Future works would be to compare the performance of Naïve Bayes with other well-known classifiers and also to consider scrapping and processing non-English words from other social media platforms such as the Facebook and Instagram.
Keywords: detection, suicide, tweets, Naïve Bayes
_________________________________________________________________________
1. Introduction
Nowadays the world has become more digital due to the Covid-19 pandemic. The internet usage has become a necessity and thus, the sharing of opinions through media social platforms has become a trend. Twitter has become one of the most visited social media due to the platform’s convenience and easiness to share opinions (Mahasiriakalayot et al., 2021). Many people use Twitter as the place to express their feelings or share their thoughts. Due to the availability of large amount of data from the Twitter, there are various research on the sentiment analysis that have been conducted. Sentiment Analysis has become trending and has been proven as the best way to judge people’s opinion, attitudes and emotions from Twitter data (Alsalman, 2020). Since the pandemic, the society has been encouraged to use the internet, especially during the lockdowns. People from all walks of life have turned to media social sites
to express their thoughts and deepest struggles. Individuals with mental health problems especially the teenagers love to share the thoughts of suicide in the social media (Mahasiriakalayot et al., 2021). There are a lot of cases where the victims have committed suicide after writing their final thoughts on Twitter and other online communities (Burdisso et al., 2019). These large amount of data on people’s feelings and behaviours over the social media could be used for the early detection of at-risk individuals and may help to prevent tragic deaths.
The pandemic has caused negative effects to the mental health due to the many changes and difficulties in the livelihood of people. Due to the growing mental cases, this research has proposed to analyse the Twitter data for the detection of suicidal intention. The machine learning algorithm, Naïve Bayes has been proposed to be applied in the detection of the suicidal tweets. Machine learning techniques have been able to solve the sentiment classification problems with high accuracies (Kancharapu et al., 2022; Sakib et al., 2021; Patel & Soni, 2021).
It has been reported that Naïve Bayes could better process the categorical input variables such as tweets data rather than the numerical variables (Raghuwanshi & Pawar, 2017). The objective of the research is to explore the Naïve Bayes algorithm capability in solving this sentiment classification problem. It is expected that the algorithm could produce good performance in the detection of the suicidal tweets. There are various other algorithms that can be applied to classify the text and Naïve Bayes is one of the best among them. Naive Bayes has been studied broadly since 1950s and currently remains a standard method for text categorization (Mehra et al., 2017). The paper is arranged into 5 main sections which are the Introduction, Literature Review, Research Methodology, Result and Discussion and finally the Conclusion section.
2. Literature Review
2.1 Similar Works
There are various machine learning algorithms that have been adopted for the suicide ideation detection in Twitter. Some recent works have been implementing deep learning algorithms such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) (Hinduja et al., 2022; Figuerêdo et al., 2022; Kancharapu et al., 2022). Other algorithms that have been adopted are such as Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbour (KNN), Decision Tree (DT), Naïve Bayes (NB), XGBoost and AdaBoost (Chatterjee et al., 2022; Sakib et al., 2021). Table 1 shows the summary of the recent similar works which contain the algorithm, objective and results of the research. The algorithms that have generated highest accuracies are such as Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Logistic Regression (LR).
Based on the table, it could be seen that the performance of the algorithms is different in each of the research. This is due to the different setting of parameters and also different dataset that have been used in each of the experiment. The Naïve Bayes that has been proposed in this research is not the best algorithm in the recent similar works. However, the algorithm is still being adopted by the researchers and has proven to be capable in generating good performance in other sentiment classification problems (Isnain et al., 2021; Dewi et al., 2020). Besides, the exploration of Naive Bayes algorithm in another classification problem is always an open problem.
Table 1: Summary of Similar Works
Algorithm Objective Result Reference
1 Long Short-Term Memory (LSTM)
To monitor mental health for early detection of problems
The accuracy (84.31%) was higher than SVM (75.81%) and Random Forest (81.21%)
(Hinduja et al., 2022)
2 Convolutional Neural Networks (CNN)
to improve early detection of
depression in social media
CNN achieved better results compared to literatures
(Figuerêdo et al., 2022)
3 Long Short-Term Memory (LSTM)
To identify suicidal tweets
LTSM achieved highest accuracy (87%) compared to NB (65%), SVM (79%) and KNN (86%)
(Kancharapu et al., 2022)
4 Logistic Regression (LR)
To examine tweets and discover features of suicide ideation
Highest accuracy achieved by LR (87%) compared to Random Forest (70%), SVM (74%) and XGBoost (77%)
(Chatterjee et al., 2022)
5 Ensemble Methods (AdaBoost,
CatBoost, XGBoost, Gradient Boost, Bagging, and Voting Classifier)
To detect suicidal ideation tweets
Ensemble method achieved highest accuracy (95.7%) compared to Decision Tree (88.5%), SVM (86.3%), AdaBoost (90.3%) and NB (89.9%)
(Sakib et al., 2021)
2.2 Naïve Bayes
Naïve Bayes algorithm is the statistical classification algorithm based on the Bayes theorem.
The classifier works by finding the conditional probability of each individual event. There are three popular Naïve Bayes classifiers which are the Gaussian, Multinomial and Bernoulli classifiers. In this research, the Multinomial Naïve Bayes classifier has been chosen due to its suitability for text classification and has proven to achieve high accuracy in prediction. The multinomial model is designed to determine frequency number of times a term occurs in a document. The model uses the probability rule to apply to text classification. The standard Naive Bayes formula is represented by Equation 1.
𝑷(𝒙|𝒚) =𝑷(𝒚|𝒙).𝑷(𝒙)
𝑷(𝒚) Equation 1 Based on Equation 1, 𝑃(𝑥|𝑦) is the posterior probability of class, where x is the target and y is the attribute. P(y|x) is the probality of predictor where y is the attribute and x is the target. 𝑃(𝑥)
is the prior probability of class and 𝑃(𝑦) is the prior probability of the predictor. Naive Bayes result is calculated by finding the posterior probability P(x|y). It is obtained from the product of likelihood P(y|x) and class prior probability P(x), then divided by the predictor prior probability P(y).
Naive Bayes is simple and performs well in many real-world problems such as document categorization, e-mail filtering and spam detection (Angeles et al., 2021). Among other advantages of Naive Bayes are it is computationally efficient and could maintain good performance when facing noise and missing data (V & Samuel, 2022). This algorithm also has some disadvantages such as the ‘zero-frequency problem’. It happens when the algorithm assigns zero probability to a category in the test data set. This will result in ‘zero’ in calculation of the formula. However, this problem could be overcome with the implementation of a smoothing technique that will push the likelihood towards a value of more than zero. Even with some problems, it is worth to explore the algorithm’s capability in solving this sentiment classification problem based on its many proven previous good performance.
3. Research Methodology
3.1 Data Collection
The dataset for this research were collected using the Twitter API Tweepy during the Malaysia’s second lockdown in May 2021. The tweets were scrapped in real time and were all in English language. Authentication from Twitter was needed before the data scraping and there were four keys requested which are the consumer key, consumer secret, access key and access secret. The data can be extracted from Twitter at any time but limited to 60 requests per minute. In this research, there were about 5439 tweets were extracted finally and were based on the worldwide tweets. The tweets were extracted by searching the keywords such as
“stress”, “anxiety”, “depression” and “suicide”. Those keywords were extracted based on the previous literature, which expressed the representation of the suicidal thoughts in tweets (Madhu, 2018). Table 2 shows the example of raw dataset.
Table 2: Example of Raw Dataset
Created_at tweets username location
2021-05-01 14:02:11
b'You blowing up my phone talking all this mess, 30 missed calls tryna make me stress'
b'jessiesimmonsxo' b'Ottawa, Ontario'
2021-05-01 14:02:10
b"@smaIIshaq
@XpressCAS @dejiimole
@heisrema Rich depression :I have it all what's next? U begin...
b'ellamsgeorge1' b'Lagos, Nigeria'
2021-05-01 14:02:10
b'@EdwinandHubble
@nortylucy
@Disneyspaniel Oh Ed, you are right. It must be all the end of semester stress that
got\xe2\x80\xa6
https://t.co/6FTBZBL8aA'
b'sjnwoh' b'Ohio'
2021-05-01 14:02:10
b'i cannot stress this enough lol
https://t.co/knuLTEeC2T'
b'shilpalazarus' "b""where i'm meant to be."""
2021-05-01 14:02:07
b'The greatest weapon against stress is our ability to choose one thought over
another.\nWilliam James - 1842-1910 - American Philosopher'
b'thejourneyrun' "b'Kuala Lumpur, MY'"
3.2 Data Pre-processing
Data pre-processing is an important data cleaning phase before the data could later be used and analysed. During this phase, any unwanted words in the tweets should be removed. The data pre-processing steps involve the removal of URL and unwanted characters, tokenization and stop word removal, stemming and lemmatization.
Removal of URL and unwanted characters
In this phase, the URL should be discarded other than the symbols and the unwanted characters.
Hashtags, mentions and retweets also must also be removed. The presence of these types of noises does not give any significant effect to the sentiment but it should be removed to improve the efficiency of the algorithm. Moreover, it will make the term frequency bigger and more space will be needed to save the file.
Tokenization and Stop Word Removal
Tokenization is the process that splits the sentences into tokens. Stop words are the redundancy of tweets which is not causing any effect if the words are being removed. Stop words removal should be applied after tokenization process. Stop words has to be eliminated from the sentences in the dataset because it does not have any purposes in the dataset.
Stemming and Lemmatization
The last step is to convert the words into its original form, which is called the stemming process.
Stemming takes the linguistic root of words. After stemming, lemmatization is to be done to convert the word to its meaningful base form called the lemma. Lemmatization process produces better outcome word after the stemming process. In this research, only adjectives, verb and noun of tweets are used to do lemmatization. This is because the pronoun, conjunction, preposition and interjection are in the list of stop words. The process of data cleaning will be completed after performing the lemmatization. Figure 1 displays the example of cleaned text, tokenized tweet, stemmed and lemmatized tweets.
Figure 1: Example of stemmed and lemmatized tweets
3.3 Data Labelling
In this research, the python library TextBlob has been used to label the sentiment of dataset.
TextBlob is an easy API for basic natural language processing (NLP) tasks especially for classification. TextBlob’s labelled dataset has been recorded to have higher accuracy when it is applied in the classifier (Hasan et al., 2018). Polarity and Subjectivity values are used to label the data. If polarity of tweets is more than 0, it will be classified as Positive (1). If the polarity is less than 0, it will be classified as Negative (0). In this research, the Positive label is the non- suicidal tweets, while the Negative is the suicidal tweets. There were 3590 data that have been labelled as Negative, while another 1849 have been tagged as Positive.
3.4 Feature Extraction
In the feature extraction phase, the term frequency refers to the occurrence of words that appears in each tweet for each sentiment class. The tweet will be tokenized as unigram token.
A helper function has been used for tokenizing the newly processed tweet. Next, the tokens are calculated to create a word frequency dictionary. In this research, the Bag of Words technique has been used for the feature extraction. Figure 2 shows the term frequency dictionary of this research. The term frequency will be used by the Naïve Bayes algorithm for the classification process.
Figure 2: Term Frequency Dictionary
3.5 System Architecture
Figure 3 represents the proposed system architecture for the research. The research begins with the data collection, which is the scrapping of the Twitter data. Then, all the extracted tweets will be stored into csv file and will undergo the data pre-processing steps. The steps include data cleaning, tokenization, removal of stop words, stemming and lastly the lemmatization.
After the data cleaning process is done, the data will be labelled using the TextBlob. After that, the term frequency of each word will be calculated in the feature extraction process. In this phase, Bag of Words technique will be used and the data will be split into training and testing dataset. In the training and testing dataset, prior probability will be computed for each class.
Next, conditional probability is computed for every term existed in Term Frequency Dictionary. After that, the posterior probability is calculated for each string of terms.
Performance evaluation is applied on the training and testing dataset to test the classifier model’s accuracy. As for the model prototype, the user will input raw tweets or texts through the user interface. The classifier model will classify input data and display the result whether it is Positive or Negative sentiments.
Figure 3: Proposed System Architecture
3.6 Performance Measurement
The performance measurement for the Naïve Bayes classifier has been based on the Confusion Matrix formula. Figure 4 shows the confusion matrix formula which consists of True Positive, False Positive, True Negative and False Negative. The True Positive is for correctly predicted event values while False Positive is for incorrectly predicted event values. True negative is for correctly predicted no-event values while False Negative is for incorrectly predicted no-event values. In the evaluation of the model accuracy, precision, recall and F1-score are also used as the performance metrics. PR, RE, CA and F1 in Figure 4 is the precision, recall, classification accuracy and F-Score respectively. Precision is to determine the value of false positive while recall is for calculating the actual positive. Recall also can be used to select the best model when value of false negative is high. F1-Score is needed to calculate the balance between precision and recall.
Figure 4: Confusion matrix formula (Bittrich et al., 2019)
4. Results and Discussions
4.1 Evaluation Results
This section presents the performance of Naive Bayes which is based on accuracy, precision, recall and f1-score from the Confusion Matrix. For the performance evaluation, the training and testing data have been divided into 3 percentage splits which are 70:30, 80:20 and 90:10.
Table 3 shows the results of the performance evaluations for each of the percentage split. The performance recorded are from each of the testing part results.
Table 3: Results of Performance Evaluation Based on Percentage Splits
Evaluation Training and Testing Split (%)
70:30 80:20 90:10
Accuracy 80.39 79.41 79.96
Precision 78.00 78.00 78.00
Recall 76.00 75.00 76.00
F1-Score 77.00 76.00 77.00
Based on Table 3, the percentage split of 70:30 has generated the highest accuracy of 80.39%.
It could be seen that the increase of the training percentage does not improve the classifier accuracy. Therefore, the 70:30 percentage split has been chosen to be used with the final Naive Bayes model. The accuracy is the number of correctly classified data (true positives and true negatives) over the whole data (true positives, true negatives, false positives, and false negatives). The accuracy of more than 80% has shown that Naive Bayes is able to detect the suicidal tweet with an acceptable performance. The accuracy of Naive Bayes from this research is higher than the accuracy of 65% obtained by Kancharapu et al (2022) and 76.67% accuracy obtained by Chadha and Kaushik (2021). This shows that the Naive Bayes model from this research could be a better model than several of the previous similar works.
As for the precision, recall and F1-score, the results does not have significance difference among the percentage splits. This is why the accuracies generated also do not have much differences with each other, with not more 1% differences. The precision is the measurement of the ability of the classifier to classify positive data, while the recall measures the number of correctly classified positive data. The F1-score represents the harmonic means between the precision and recall. Based on Table 3, the results of the F1-score shows that the Naive Bayes also has generated acceptable performance in the ability to classify positive data.
4.2 Data Analysis
This section discusses about the data which have been scrapped from the Twitter based on the data labelling and the word cloud. Figure 5 shows the number of negative and positive tweets that have been scrapped in May 2021. The tweets were labelled using the TextBlob python library. It could be seen that more negative statements were tweeted by the society during the lockdown. This shows that more people have to deal with their mental health issues. The pandemic has created a lot of anxieties, depressions, stress and fear among the society which need to be addressed seriously. The negligence towards mental health problems could be the cause for the intended suicide.
Figure 5: Negative and Positive Label of Tweets
Figure 6 shows the word cloud for the Negative tweets. The negative words that have been much used in the Twitter were depression, suicide, anxiety, stress and many others. This word cloud is a more convenient way for the researchers to determine the commonly used words for a particular problem from the Twitter data. The freedom of speech through the social media have given the society the opportunity to express their feelings and negative thoughts without limits. Individuals who suffer mental illness may embarrass to seek professional assistance, so Twitter may be the only way to express their deepest feelings without facing someone directly.
Figure 6: Word Cloud for Negative tweets
4.3 Interface of Proposed System
The development of the user interface of the proposed system is based on the Tkinter package.
Tkinter is the most suitable method to create an interface since it is an easy way to create a simple GUI application from python codes. In the prototype, the user needs to type the tweets in the text field provided for the analysis. The output results will display whether the statement is Positive or Negative with the percentage values. Figure 7 shows the sample results from the proposed system testing.
Figure 7: Sample Results from Proposed System Testing
5. Conclusion
The performance of Naïve Bayes algorithm has been explored in this sentiment classification problem. The main contribution of this research is the implementation of Naïve Bayes algorithm in the detection of the suicidal ideation. The algorithm has proven to be reliable and able to detect the suicidal tweets from the Twitter data. The evaluation results have shown that the algorithm has produced good and acceptable performance with 80.39% accuracy. However, the accuracy of Naïve Bayes algorithm might further be increased if more dataset is processed, more improved pre-processing steps are applied and different technique for the data labelling is adopted. The significance of this research is that it will detect the suicidal ideation and give warning sign before the suicide happen. This research could help the psychiatrists and authorities to detect someone who is having depression. The cooperation from the public and authorities could help many people with depressions and could prevent any suicides to happen.
Future works would be to compare the performance of Naïve Bayes with other classifiers such as K-Nearest Neighbour, Support Vector Machine and Random Forest. Other future works to consider would be scrapping and processing non-English words from other social media platforms such as the Facebook and Instagram.
References
Alsalman, H. (2020). An Improved Approach for Sentiment Analysis of Arabic Tweets in Twitter Social Media. 2020 3rd International Conference on Computer Applications &
Information Security, 4–7. https://doi.org/10.1109/ICCAIS48893.2020.9096850
Angeles, A., Quintos, M. N., Jr., M. O., & Jr., R. R. (2021). Text-Based Gender Classification of Twitter Data using Naive Bayes and SVM Algorithm. 2021 IEEE Region 10
Conference (TENCON), 522–526.
https://doi.org/10.1109/TENCON54134.2021.9707402
Bittrich, S., Kaden, M., Leberecht, C., Kaiser, F., Villmann, T., & Labudde, D. (2019).
Application of an interpretable classification model on Early Folding Residues during protein folding. BioData Mining, 12(1), 1–17. https://doi.org/10.1186/s13040-018-0188- 2
Burdisso, S. G., Errecalde, M., & ManuelMontes-y-Gómez. (2019). A Text Classification Framework for Simple and Effective Early Depression Detection Over Social Media Streams. Expert Systems with Applications, 133, 182–197.
https://doi.org/10.1016/j.eswa.2019.05.023
Chadha, A., & Kaushik, B. (2021). Machine Learning based Dataset for Finding Suicidal Ideation on Twitter. Third International Conference on Intelligent Communication
Technologies and Virtual Mobile Networks, 823–828.
https://doi.org/10.1109/ICICV50876.2021.9388638
Chatterjee, M., Samanta, P., Kumar, P., & Sarkar, D. (2022). Suicide Ideation Detection using Multiple Feature Analysis from Twitter Data. 2022 IEEE Delhi Section Conference (DELCON).
Dewi, T. B. T., Indrawan, N. A., Budi, I., Santoso, A. B., & Putra, P. K. (2020). Community Understanding of the Importance of Social Distancing Using Sentiment Analysis in Twitter. 2020 3rd International Conference on Computer and Informatics Engineering, 336–341. https://doi.org/10.1109/IC2IE50715.2020.9274589
Figuerêdo, J. S. L., Maia, A. L. L. M., & Calumby, R. T. (2022). Early depression detection in social media based on deep learning and underlying emotions. Online Social Networks and Media, 31, 100225. https://doi.org/10.1016/j.osnem.2022.100225
Hasan, A., Moin, S., Karim, A., & Shamshirband, S. (2018). Machine Learning-Based Sentiment Analysis for Twitter Accounts. Mathematical and Computational Applications, 23(1), 11. https://doi.org/10.3390/mca23010011
Hinduja, S., Afrin, M., Mistry, S., & Krishna, A. (2022). International Journal of Information Management Data Insights Machine learning-based proactive social-sensor service for mental health monitoring using twitter data. International Journal of Information Management Data Insights, 2, 100113. https://doi.org/10.1016/j.jjimei.2022.100113 Isnain, A. R., Marga, N. S., & Alita, D. (2021). Sentiment Analysis Of Government Policy On
Corona Case Using Naive Bayes Algorithm. Indonesian Journal of Computing and Cybernetics Systems, 15(1), 55–64.
Kancharapu, R., SriNagesh, A., & BhanuSridhar, M. (2022). Prediction of Human Suicidal Tendency based on Social Media using Recurrent Neural Networks through LSTM. 2022 International Conference on Computing, Communication and Power Technology (IC3P), 123–128. https://doi.org/10.1109/ic3p52835.2022.00033
Madhu, S. (2018). An approach to analyze suicidal tendency in blogs and tweets using Sentiment Analysis. International Journal of Scientific Research & Management Studies, 6(4), 34–36. https://doi.org/10.26438/ijsrcse/v6i4.3436
Mahasiriakalayot, S., Senivongse, T., & Taephant, N. (2021). Predicting Signs of Depression from Twitter Messages. 19th International Joint Conference on Computer Science and Software Engineering (JCSSE). https://doi.org/10.1109/JCSSE54890.2022.9836287 Mehra, R., Singh, M. K. B. G., Arora, R., Bala, T., & Saxena, S. (2017). Sentimental Analysis
Using Fuzzy and Naive Bayes. 2017 International Conference on Computing
Methodologies and Communication (ICCMC), 945–950.
https://doi.org/10.1109/ICCMC.2017.8282607
Patel, H., & Soni, N. (2021). Machine Learning Based Approach for Prediction of Suicide Related Activity. Proceedings - 2nd International Conference on Smart Electronics and
Communication, ICOSEC 2021, 967–972.
https://doi.org/10.1109/ICOSEC51865.2021.9591836
Raghuwanshi, A. S., & Pawar, S. K. (2017). Polarity Classification of Twitter Data using Sentiment Analysis. International Journal on Recent and Innovation Trends in Computing and Communication, 5(6), 434–439. https://doi.org/10.17762/ijritcc.v5i6.792 Sakib, T. H., Ishak, M., Jhumu, F. F., & Ali, M. A. (2021). Analysis of Suicidal Tweets from Twitter using Ensemble Machine Learning Methods. 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), 8–9.
https://doi.org/10.1109/ACMI53878.2021.9528252
V, V. K., & Samuel, P. (2022). A Multinomial Naïve Bayes Classifier for identifying Actors and Use Cases from Software Requirement Specification documents. 2022 2nd International Conference on Intelligent Technologies, 1–5.
https://doi.org/10.1109/CONIT55038.2022.9848290