Hoax Detection Tweets of the COVID-19 on Twitter Using LSTM- CNN with Word2Vec
Prisla Novia Anggreyani*, Warih Maharani Fakultas Informatika, Telkom University, Bandung, Indonesia
Email: 1,*[email protected], 2[email protected] Email Penulis Korespondensi: [email protected]
AbstractβThe growth of Twitter users is increasing every year, impacting activities in social media such as hoaxes that are increasingly widespread on various platforms. During this pandemic, the rate of hoaxes is growing because nowadays, it is very easy for humans to interact with each other, have opinions, and exchange information. One of the hoaxes that often appears is the hoax about the Covid-19 virus. Therefore, a method for detecting hoaxes is needed, especially for the topic of the Covid- 19 virus in Indonesia. The method used in hoax detection is LSTM-CNN with Word2Vec. More than 1000 tweets data are used in this study, divided into hoax and non-hoax categories. Detection is carried out to analyze the hoax results obtained by using Word2Vec as a method to convert data as a classification vector and LSTM-CNN to classify the data. This work's result showed that the LSTM-CNN model with Word2Vec achieves 79.71% accuracy, surpassing the LSTM model and CNN model.
Kata Kunci: Hoax; Twitter; LSTM; CNN; Word2Vec
1. INTRODUCTION
Twitter user growth is accelerating year-over-year. It is expected to grow 26% year-on-year in the fourth quarter of 2021 to an average of 192 million users. This increase resulted in a spike in the use of tweets on Twitter. The increasing use of social media today will certainly have positive and negative impacts. One of the negative impacts of social media is the number of hoaxes that prevail on social media as it makes it easier for people to interact, express their opinions and share information. Based on data from the Ministry of Communication and Information, since August 30, 2021, there have been 4,163 hoax uploads about Covid-19, including Facebook reaching 3,523, Twitter 554, YouTube 49, Instagram 35, and TikTok 2. Hoax findings around Covid-19 have increased from July 29, 2021, to 1,814 issues with a distribution of 4,142 uploads. The distribution of hoaxes about Covid-19 includes Facebook reaching 3,502 uploads, Twitter 554 uploads, Instagram 35, YouTube 49, and TikTok 2 uploads [1].
Hoax is news or information that contains data that is not known to be accurate, or which did not actually happen [2] . So, there will be many perceptions of pros and cons for the wider community who use social media regarding news, which can impact daily life. Based on the background of increasing cases of hoax spread, especially in Indonesia regarding the hoax news of the Covid-19 virus in Indonesia, it is necessary to conduct research related to this. The existing research was conducted by Rajdev et al. using Functional Tree, Decision Tree with Naive Bayes on the leaf, and Random Forest [3].
Based on research that stated before, Functional Trees produce the highest accuracy, which is 91.71%. In addition, there is research on the detection of hoax news in Indonesian [4]. Nave Bayes is used for the classification of hoax and non-hoax news where the accuracy obtained is 78.6%[5]. Testing is done with deep learning, such as Long Short-Term Memory (LSTM), Bidirectional LSTM (BI LSTM), Gated Recurrent Unit (GRU), Bidirectional GRU (BI-GRU), and 1Dimensional Convolutional Neural Network (1D-CNN) as well as two classifiers, namely Support Vector Machine (SVM) and NaΓ―ve Bayes. From the results of the study, it was found that deep learning is better than supervised text classification. Another study regarding A Study on Fake News Detection Using NaΓ―ve Bayes, SVM, Neural Networks and LSTM by Reddy et al. [6]. Tests were carried out using the NaΓ―ve Bayes Model, Support Vector Machine, Neural Network, and LSTM. The study resulted in the highest accuracy, which is 94.27% for LSTM. As for research [7] hoax detection uses the Naive Bayes method but the data taken is still general. Then research [8] detected hoaxes with the TF-IDF method in the classification of news topics for the presidential election. from existing research, no one has discussed hoaxes with the topic of Covid-19.
Some of these studies have different results in detecting hoaxes on Twitter social media. Many factors affect these results, such as differences in models and the number of datasets used. So that research can be carried out to detect twitter hoaxes about the Covid-19 virus. By choosing the highest level of accuracy from the previous study, the LSTM model [6], and comparing it with other studies, the results of combining LSTM-CNN produce better accuracy than CNN-LSTM, CNN, LSTM [9]. The contribution of this research is to compare the results that will be obtained using the LSTM-CNN and Word2Vec methods on Twitter social media regarding the Covid-19 virus tweet.
2. RESEARCH METHODOLOGY
2.1 System Design
The system built is intended to be able to detect hoaxes on topics Covid-19 vaccine. In detecting hoaxes, the steps that need to be done are data retrieval, pre-processing, feature extraction, modelling, and evaluation.
Figure 1. Research Method 2.2 Collecting Data
There are 2 processes carried out to conduct research on the detection of hoax tweet in Bahasa Indonesia using LSTM-CNN with Word2Vec from social media Twitter, such as :
1. Crawling Data
Data is crawled from social media Twitter based on keywords related to the virus COVID-19. We use the Twint tool to carry out the data, advanced web scraping tool built in Python which scrapes the web instead of collecting data through the twitter API like tweepy. It is short for Twitter Intelligence Tool [10]. After the data is collected, the labelling of the data is done manually by three validators to determine whether the data is a hoax or not. The types of the hoax that are used as the criteria for labelling data are as follows [11].
a. Satire or parody is content to insinuate or criticize certain parties and has the potential to deceive the public.
b. Misleading content is utilizing original information to lead public opinion where the information with the content to be created has no relationship.
c. Imposter content is content that is created to deceive through content similar to the original.
d. Fabricated content contains information that is not valid and cannot be accounted for.
e. False connection contains information that has difference between the content, the content's title, and the content's image.
f. False context contains information that is not adjusted to the facts.
g. Manipulated content is information that is changed content to deceive the people who read it.
After labeling, data is combined into a CSV file with a tweet_id, tweets, and the label. The dataset contains 410 hoax tweets and 623 non-hoax tweets.
2. Labelling Data
After the data has been collected, then the data is labelled with two categories, hoax (1) and non-hoax (0), Labeling the data using 3 people as data reviewers. which data will be categorized as a hoax if there are 2 people who give 1 for the tweet as a hoax.
Table 1. The Data Sample
IDTweet Tweet Label
01 Negeri bengekkk. Bahkan istilah kata ""COVID-19"" aja DIKRIMINALISASI DIMONSTERISASI...
1 02 Saatnya Anda Memilih #RamuanClovid untuk menyembuhkan Covid-19. Ramuan
CLO2VID
1 03 @kepo_jepang Genki desu. Hari ini tepat 2 tahun kasus pertama Covid-19
dan sedang musim hujan. Masih banyak yang kena Covid-19 dan sakit flu.
0
2.3 Preprocessing
Pre-processing is done by changing the form of data from data collection so that it becomes structured data. This step is done because tweets often have a lot of noise in the text, which can reduce accuracy results. The steps taken in data pre-processing are:
a. Remove Punctuation: remove punctuation marks from the text.
b. Remove Stop words: remove common words used.
c. Tokenization: divide the text into tokens.
d. Case Folding: change all sentences to lowercase.
Then carry out the data preprocessing process, with the next step extracting the Word2Vec feature and LSTM classification. With Word2Vec unreadable text data during the LSTM-CNN deep learning process will be made easier. Figure 3 below shows that one of the vector representations of the word 'pandemi' for most similar will place similar words close to each other in this space [12]. With this vector, it can be used to carry out the LSTM-CNN classification process. Words that have been converted into vectors are collected into a place called vector space [13].
Figure 2. Vector from the word βpandemiβ
2.4 Feature Extraction
To do the modelling, the data that has been pre-processed needs to be converted into a matrix or vector. The process is done using Word2Vec. Then the data was divided into train data and test data. Word2Vec builds vocabulary from the data available and later determines how to represent each word. Word2Vec is a classic method of creating word insertion in Natural Language Processing (NLP). It takes words from large text sets as input and learns to provide vector representations [14]. Two learning algorithms in Word2Vec are Continuous Bag-Of-Word (CBOW) and Skip-gram [15].
2.5 LSTM Modelling
Classification of whether a data is a hoax or not is done by LSTM-CNN modelling. The LSTM-CNN model consists of an initial layer that accepts word insertion for each token in the tweet as input. Intuition where the output token will store information not only from the initial token but also from the previous token. In other words, the LSTM layer generates a new encoding for the original input. Long Short-Term Memory (LSTM) is a form of development of a recurrent network that can avoid long-term dependency problems using memory cells and gate units [16]. LSTM has a series of repetitive modules like RNN but a different structure. The output of the LSTM layer is then fed into a convolution layer which is expected to extract local features. Convolutional Neural Network (CNN) is a deep learning model commonly used in computer vision. However, the CNN model can also be used for Sentence Classification, as was done in Kim's research [5]. The way CNN works is like a sliding window. Each word is represented in a 3dimensional vector with a weight matrix that will shift horizontally throughout the sentence. Later, the output of the convolution layer will be aggregated to a smaller dimension and ultimately issued as a positive or negative label [17].
2.6 Evaluation
In evaluating the model, the Confusion Matrix is used in calculating Accuracy, F1-Score, Precision, and Recall.
The Confusion Matrix consists of several values, which are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). With the formula shown in equations (1), (2), (3), and (4).
π΄πππ’ππππ¦ = ππ+ππ
ππ+πΉπ+ππ+πΉπ (1)
ππππππ πππ = ππ
ππ+πΉπ (2)
π πππππ = ππ
ππ+πΉπ=ππ
π (3)
πΉ1 β πππππ =2β(π πππππβππππππ πππ)
π πππππ+ππππππ πππ (4) Information:
TP = Amount of data with positive class (0) predicted true as positive class (0), where TP and TN indicate the number of true predictions made by the model.
TN = Amount of data with negative class (1) predicted true as negative class (1), where TP and TN show the number of true predictions made by the model.
FP = Amount of data with negative class (1) predicted to be true as positive class (0), with TP and TN indicating the number of false predictions made model.
FN = Amount of data with positive class (0) predicted true as negative class (1), with TP and TN showing the number of false predictions made model.
3. RESULT AND DISCUSSION
In this experiment, we conducted several scenarios to know which combination of parameters obtained the best result. The parameters that we tested are the number of units, dropouts value, and regularizer. For the number of units, we tried to use 16, 32, and 64. Dropout values are from 0.1 to 0.3. We used a bias regularizer and a kernel regularizer for the LSTM and convolutional layer. As for the dense layer, we also added an activity regularizer.
We use the same learning rate, and data split ratio for all the scenarios. The learning rate used is 0.001, and the data split ratio is 80:20. All the training for each scenario is done in 20 epochs with a callback function to train more efficiently.
Table 2. Result of Each Scenario
Regularizer Dropout Units Performance
Precision Recall F1-Score Accuracy
N
0.1
16 76.74 73.88 74.61 76.81
32 79.44 76.51 77.32 79.23
64 55.35 50.63 40.83 60.39
0.2
16 74.76 74.76 74.76 75.85
32 74.95 72.47 73.10 75.36
64 81.71 75.25 76.37 79.23 0.3
16 77.70 74.07 74.89 77.29
32 76.29 73.27 74.00 76.33
64 80.58 73.42 74.42 77.78
Y
0.1
16 81.60 76.07 77.17 79.71
32 76.74 73.88 74.61 76.81
64 47.97 48.06 47.92 51.21
0.2
16 77.01 75.94 76.34 77.78
32 76.07 73.48 74.16 76.33
64 78.86 75.0 75.96 78.26
0.3
16 79.63 77.33 78.05 79.71
32 80.41 70.56 71.22 75.85
64 75.00 73.71 74.15 75.85
The highest accuracy is 79.71%. It was obtained using 16 units for LSTM, CNN, and Dense layers after applying dropouts and some regularizers. With the same parameters applied, the model with a dropout value of 0.1 had the same accuracy as the model with a dropout value of 0.3. However, the recall and F1-Score of a model with a dropout of 0.3 is higher. From the results shown in Table 1. the number of units and dropout value is not directly proportional to the model's performance.
However, in this result, smaller units work better for the model. It is because the text length in the dataset used is generally short. The dropout value for this dataset can increase the performance by tackling overfitting. It is because some data were dropout, so the units in the model do not depend too much on another unit.
Table 3. Comparison of Loss
Dropout Units Without Regularizer With Regularizer Loss Val Loss Loss Val Loss 0.1
16 0.2237 0.5588 0.4772 0.6517 32 0.2026 0.4729 0.5753 0.6993 64 0.1169 0.5880 0.3891 0.7177 0.2
16 0.2397 0.4968 0.4994 0.6385 32 0.1919 0.5584 0.5381 0.6833 64 0.1305 0.7221 0.6068 0.8045
0.3 16 0.2411 0.5811 0.5059 0.6472
32 0.1993 0.5553 0.5511 0.6836
Dropout Units Without Regularizer With Regularizer Loss Val Loss Loss Val Loss 64 0.2228 0.4883 0.6668 0.8102
Furthermore, Table 3 shows the loss differences between the training on the last epoch for each scenario with regularizers and the scenario without regularizers. Although, the model without a regularizer has smaller loss values, the model with regularizers has smaller gaps between loss and validation loss.
Figure 3. Hoax Detection System Flowchart
Figure 2 shows the plot of accuracy and loss of model without regularizer. The training loss is decreasing, but the validation loss is increasing. It shows that the model without a regularizer tends to overfit. The overfit model learns from the noise, which causes the performance to be reduced when it is tested on other data.
Figure 4. Hoax Detection System Flowchart
Figure 3 shows the loss and accuracy of the model with regularizer. Even though the loss is higher than the model without regularizer, the training loss and validation loss are decreasing. It is a good fit model because it performs well on train data and validation data.
After defining the best parameter for the model, we applied the same parameter to LSTM model and CNN model. The results of this experiment showed in Table III. It shows that the performance of LSTM-CNN surpass LSTM and CNN.
Table 4. Comparison of Loss Performance
Model Precision Recall F1-Score Accuracy
LSTM 79.97 75.27 76.26 78.74
CNN 79.61 75.48 76.42 78.74
LSTM-CNN 79.63 77.33 78.05 79.71
Based on Table 4, the best results obtained are accuracy for LSTM-CNN merger and compared with LSTM and CNN in the performance section for recall, F1-score, and LSTM-CNN accuracy is higher than LSTM and CNN. By using more than 1000 tweet data, the more data used, the better the accuracy results. In this test, there are also overfitting results on precision because hoax and non-hoax data are still less varied, causing inconsistent results to be seen on the flowchart graph. The results of the sample data test were successful in detecting hoaxes in tweets using LSTM-CNN with Word2Vec in table 5.
Table 5. Prediction
Tweet ID Label Prediction Result
01 1(hoax) 1 (hoax)
02 1(hoax) 1 (hoax)
03 0 0 (non-hoax)
4. CONCLUSION
This research conducts several scenarios to search for the best parameters for LSTM-CNN in detecting hoaxes on the COVID-19 topic. We conducted several experiments to find best parameter of the model. The result of the experiment shows that LSTM-CNN could achieve 79.71% accuracy by using 16 units in the layers by combining dropout and regularizer. The combination of LSTM and CNN obtained higher accuracy than model with only LSTM or CNN. The use of LSTM-CNN with Word2Vec can process large amounts of data to detect hoaxes. In addition, at the dataset collection stage, more variations of hoax and non-hoax data are needed. The performance can be increased for future works by trying other methods and using a larger dataset from other social media, such as Facebook.
REFERENCES
[1] L. Rizkinaswara, βKominfo Temukan 1.819 Isu Hoaks Seputar Covid-19,β Kominfo.
https://aptika.kominfo.go.id/2021/08/kominfo-temukan-1-819-isu-hoaks-seputar-covid-19/ (accessed Oct. 26, 2021).
[2] K. Azizah, βHoax adalah Berita Bohong, Kenali Ciri-Ciri, Jenis, dan Cara Mengatasinya,β Merdeka.
https://www.merdeka.com/trending/hoax-adalah-berita-bohong-kenali-ciri-ciri-jenis-dan-cara-mengatasinya-kln (accessed Oct. 26, 2021).
[3] C. Olah, βUnderstanding LSTM Networks,β Colah.github.io. http://colah.github.io/posts/2015-08-Understanding- LSTMs (accessed Dec. 5, 2021).
[4] I. Y. R. Pratiwi, R. A. Asmara, and F. Rahutomo, βStudy of hoax news detection using naΓ―ve bayes classifier in Indonesian language,β in 2017 11th International Conference on Information & Communication Technology and System (ICTS), Surabaya, Indonesia, Oct. 2017, pp. 73β78. doi: 10.1109/ICTS.2017.8265649.
[5] B. P. Nayoga, R. Adipradana, R. Suryadi, and D. Suhartono, βHoax Analyzer for Indonesian News Using Deep Learning Models,β Procedia Comput. Sci., vol. 179, pp. 704β712, 2021, doi: 10.1016/j.procs.2021.01.059.
[6] P. Reddy, D. Roy, P. Manoj, M. Keerthana, and P. Tijare, βA Study on Fake News Detection Using NaΓ―ve Bayes, SVM,β
Neural Netw. LSTM J Adv Res Dyn Control Syst, vol. 1, pp. 942β947, 2019.
[7] H. Mustofa and A. A. Mahfudh, βKlasifikasi Berita Hoax Dengan Menggunakan Metode Naive Bayes,β Walisongo J.
Inf. Technol., vol. 1, no. 1, pp. 1β12, 2019.
[8] F. N. Rozi and D. H. Sulistyawati, βKLASIFIKASI BERITA HOAX PILPRES MENGGUNAKAN METODE MODIFIED K-NEAREST NEIGHBOR DAN PEMBOBOTAN MENGGUNAKAN TF-IDF,β KONVERGENSI, vol. 15, no. 1, Oct. 2019, doi: 10.30996/konv.v15i1.2828.
[9] P. M. Sosa, βTwitter sentiment analysis using combined LSTM-CNN models,β Eprint Arxiv, pp. 1β9, 2017.
[10] A. K. Cotra, βAnalysis On Tweets Using Python and TWINT,β Towards Data Science. Analysis On Tweets Using Python and TWINT (accessed Jun. 26, 2022).
[11] W. Kurniasih, βPengertian Hoaks: Sejarah, Jenis, Contoh, Penyebab dan Cara Menghindarinya. [Online] Gramedia,β
Gramedia. https://www.gramedia.com/literasi/pengertian-hoaks/ (accessed Nov. 09, 2021).
[12] Z. Li, βA Beginnerβs Guide to Word Embedding with Gensim Word2Vec Model,β Towards Data Science.
https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 (accessed Jun. 27, 2022).
[13] W. Widayat, βAnalisis Sentimen Movie Review menggunakan Word2Vec dan metode LSTM Deep Learning,β J. MEDIA Inform. BUDIDARMA, vol. 5, no. 3, p. 1018, Jul. 2021, doi: 10.30865/mib.v5i3.3111.
[14] D. Karani, βIntroduction to word embedding and word2vec,β Data Sci., vol. 1, 2018.
[15] B. Jang, I. Kim, and J. W. Kim, βWord2vec convolutional neural networks for classification of news articles and tweets,β
PloS One, vol. 14, no. 8, p. e0220976, 2019.
[16] S. Hochreiter and J. Schmidhuber, βLong Short-Term Memory,β Neural Comput., vol. 9, no. 8, pp. 1735β1780, Nov.
1997, doi: 10.1162/neco.1997.9.8.1735.
[17] M. Rajdev and K. Lee, βFake and Spam Messages: Detecting Misinformation During Natural Disasters on Social Media,β
in 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, Dec. 2015, pp. 17β20. doi: 10.1109/WI-IAT.2015.102.