HALAMAN SAMPUL
ANALISIS SENTIMEN DATA TWITTER DENGAN METODE DEEP LEARNING MENGGUNAKAN ALGORITMA CONVOLUTIONAL
NEURAL NETWORK PADA PILPRES 2019
TUGAS AKHIR
Bayu Adi Wibowo 41515120061
PROGRAM STUDI TEKNIK INFORMATIKA FAKULTAS ILMU KOMPUTER UNIVERSITAS MERCU BUANA
JAKARTA 2020
http://digilib.mercubuana.ac.id/
i
HALAMAN JUDULHALAMAN JUDUL
ANALISIS SENTIMEN DATA TWITTER DENGAN METODE DEEP LEARNING MENGGUNAKAN ALGORITMA CONVOLUTIONAL
NEURAL NETWORK PADA PILPRES 2019
Tugas Akhir
Diajukan Untuk Melengkapi Salah Satu Syarat Memperoleh Gelar Sarjana Komputer
Oleh:
Bayu Adi Wibowo 41515120061
PROGRAM STUDI TEKNIK INFORMATIKA FAKULTAS ILMU KOMPUTER UNIVERSITAS MERCU BUANA
JAKARTA
2020
ii
http://digilib.mercubuana.ac.id/
iii
iv
http://digilib.mercubuana.ac.id/
v
vi
http://digilib.mercubuana.ac.id/
vii
viii ABSTRAK
Nama : Bayu Adi Wibowo
NIM : 41515120061
Pembimbing TA : Ir. Dr. Eliyani
Judul : Analisis Sentimen Data Twitter dengan Metode Deep Learning menggunakan Algoritma
Convolutional Neural Network pada Pilpres 2019 Analisis sentimen pada data Twitter dapat menggambarkan pandangan publik terhadap suatu entitas. Entitas dapat berupa organisasi, individu, produk atau sebuah kegiatan. Dengan adanya model ini akan menjadi solusi untuk mengelompokan data yang berhubungan dengan entitas menjadi dua bagian yaitu positif atau negatif. Pada penelitian ini digunakan salah satu metode Deep Learning yaitu Convolutional Neural Network (CNN). Pada model CNN akan digunakan Convolutional Layer satu dimensi dan Global Max Pooling sebagai metode pooling. Metode CNN akan menghasilkan model yang dapat menganalisis sentimen dari data Twitter. Data yang digunakan merupakan data twit yang memiliki 2 kelas sentimen yaitu positif maupun negatif. Setelah dilakukan beberapa percobaan pada model CNN didapatkan akurasi sebesar 92.62%. Model akan digunakan untuk mengklasifikasikan data Twitter yang dibagi menjadi positif atau negatif. Hasil dapat menggambarkan opini publik mengenai proses Pilpres 2019. Hasil akan disimpan pada database Elasticsearch. Penggunaan elasticsearch memungkinkan pembuatan visualisasi lebih mudah dengan bantuan Kibana.
Kata kunci:
sentiment analysis, nlp, convolutional neural network, deep learning, elasticsearch
http://digilib.mercubuana.ac.id/
ix ABSTRACT
Name : Bayu Adi Wibowo
Student Number : 41515120061 Counsellor : Ir. Dr. Eliyani
Title : Sentiment Analysis with Deep Learning Method using CNN Algorithm in Twitter Data of 2019 Election
Sentiment analysis on Twitter data can illustrate the public's view of an entity.
Entity can be an organization, an individual, a product or an activity or event. This model is a solution to split data according to entities into two parts, positive and negative opinion. In this study, one of the Deep Learning methods is used, namely Convolutional Neural Network (CNN). In the CNN model one-dimensional Convolutional Layer and Global Max Pooling will be used as pooling methods.
The CNN method will produce a model that can analyze sentiments from Twitter data. The data used is tweet data in bahasa (Indonesian language) which has 2 classes of sentiments, positive and negative. After several experiments on the CNN model, the final accuracy is 92.62. The model will be used to classify Twitter data about Indonesian election in 2019. Results can reflect public opinion regarding the 2019 election process. Results will be stored in the Elasticsearch database. Using elasticsearch database allows to make visualization easier with Kibana. Kibana is data analytics and visualization for elasticsearch.
Key words:
sentiment analysis, nlp, convolutional neural network, deep learning, elasticsearch
x KATA PENGANTAR
Puji syukur kita panjatkan kepada Allah SWT karena Rahmat dan KaruniaNya-lah Penulis dapat menyelesaikan penulisan tugas akhir ini dengan judul “ Analisis Sentimen Data Twitter dengan Metode Deep Learning menggunakan Algoritma Convolutional Neural Network pada Pilpres 2019“
Penulis menyadari bahwa tanpa bantuan dan bimbingan pihak-pihak terkait Tugas Akhir ini tidak bisa selesai tepat waktu, terutama kepada Ibu Ir. Dr. Eliyani yang telah membimbing saya selama penyusunan Tugas Akhir. Banyak pihak yang telah memberikan fasilitas, membantu, membina dan membimbing penulis untuk menyelesaikan Tugas Akhir ini, khususnya kepada :
1. Ibu Ir. Dr. Eliyani selaku dosen pembimbing tugas akhir yang telah meluangkan waktu dan pikirannya untuk membimbing penulis sehingga dapat menyelesaikan tugas akhir ini.
2. Bapak Diky Firdaus, S.Kom, MM selaku dosen pembimbing akademik yang telah membimbing saya sejak awal semester yang selalu memberi dukungan dan motivasi agar lulus tepat waktu.
3. Orang tua yang selalu memberikan dukungan dan doa sehingga penulis dapat menyelesaikan penulisan jurnal dan tugas akhir beserta laporannya dengan lancar.
4. Bapak/Ibu dosen Fakultas Ilmu Komputer Unversitas Mercu Buana yang telah memberikan ilmu dan membimbing agar bisa menjadi mahasiswa yang berguna bagi orang lain.
5. Semua teman-teman saya yang telah memberikan semangat dan dukungan serta membantu saya dalam menyusun laporan tugas akhir ini.
Penulis menyadari, Tugas Akhir ini masih memiliki banyak kekurangan. Karena itu kritik dan saran yang membangun akan diterima dengan senang hati, penulis berharap keberadaan Tugas Akhir ini dapat bermanfaat dan menambah wawasan kita.
Jakarta, 18 Januari 2020 Penulis
http://digilib.mercubuana.ac.id/
xi DAFTAR ISI
HALAMAN SAMPUL ... i
HALAMAN JUDUL ... i
LEMBAR PERNYATAAN ORISINALITAS ... ii
SURAT PERNYATAAN PERSETUJUAN PUBLIKASI TUGAS AKHIR ... iii
SURAT PERNYATAAN LUARAN TUGAS AKHIR ... iv
LEMBAR PERSETUJUAN... v
LEMBAR PERSETUJUAN PENGUJI ... vi
LEMBAR PENGESAHAN ... vii
ABSTRAK ... viii
ABSTRACT ... ix
KATA PENGANTAR ... x
DAFTAR ISI ... xi
NASKAH JURNAL ... 1
KERTAS KERJA ... A
BAGIAN 1. LITERATUR REVIEW ... B
BAGIAN 2 ANALISIS DAN PERANCANGAN ... F
BAGIAN 3 SOURCE CODE ... W
BAGIAN 4 DATASET ... X
BAGIAN 5 TAHAPAN EKSPERIMEN ... Y
BAGIAN 6 HASIL SEMUA EKSPERIMEN ... HH
DAFTAR PUSTAKA ... JJ
LAMPIRAN ...KK
1
Universitas Mercu Buana
NASKAH JURNAL
Sentiment Analysis with Deep Learning Method using CNN Algorithm in Twitter Data of 2019 Indonesian Election
Bayu Adi Wibowo, Eliyani*
Computer Science Facluty, Universitas Mercu Buana Jakarta, Indonesia
*E-mail: [email protected] Abstract Sentiment analysis on Twitter data can
illustrate the public's view of an entity. Entity can be an organization, an individual, a product or an activity or event. This model is a solution to split data according to entities into two parts, positive and negative opinion. In this study, one of the Deep Learning methods is used, namely Convolutional Neural Network (CNN). In the CNN model one-dimensional Convolutional Layer and Global Max Pooling will be used as pooling methods. The CNN method will produce a model that can analyze sentiments from Twitter data. The data used is tweet data in bahasa (Indonesian language) which has 2 classes of sentiments, positive and negative.
After several experiments on the CNN model, the final accuracy is 92.62. The model will be used to classify Twitter data about Indonesian election in 2019. Results can reflect public opinion regarding the 2019 election process.
Results will be stored in the Elasticsearch database. Using elasticsearch database allows to make visualization easier with Kibana. Kibana is data analytics and visualization for elasticsearch.
Keyword: Sentiment Analysis, NLP, Convolutional Neural Network, Deep Learning, Elasticsearch
1 Introduction
Twitter has more than 319 million active users every month. With this big amount user will be generate huge data. This data is valuable for groups or individuals who have a strong position whether political, social or economic to monitoring and maintain their good reputation
[1]. Sentiment analysis or opinion mining is a computational study of people's opinions, behavior and emotions according to something.
It becomes one of the problem solutions of grouping public opinion [2], [3].
Exploring information from unstructured data text is one of the challenging study. This study belongs to the field of text mining. Some challenges are the high data size, the language that continues to grow and the unstructured data format. Text mining combines statistical techniques, Machine Learning and Natural Language Processing (NLP) [4]–[6].
The modern machine learning enables computers to solve perceptual problems such as image and sound recognition. It makes machine learning can be used in biological science. This Deep Learning method uses several layers of processing to find patterns and structures in very large data sets. There are no rules that determine how many layers are needed to form Deep Learning, but most experts agree that more than two layers are needed [7]. The superior and reliable performance of the Deep Learning method has attracted the attention of researchers working in every field of science to harness their power to solve problems [8].
Convolutional Neural Network (CNN) is part of Deep Networks for the Supervised Learning model. Based on class of the data, the network will map the input to the expected output [8]. CNN consists of three types of layers including Convolutional, Pooling and Fully Connected Layers. Convolutional Layer tries to learn the possibility of significant features that appear in the data with the help of filters /
http://digilib.mercubuana.ac.id/
2
Universitas Mercu Buana kernels whose coefficients are determined
during the training phase. The lower layers of CNN learn the basic features and increasingly in the network, the kernel will learn more complex features [8], [9].
This study discusses how CNN works on text data build a model. This model can determine whether a text has positive or negative sentiments. It makes the process of grouping public opinion from existing text data can be done easily and quickly. This process is more efficient than grouping text manually one by one.
Elasticsearch is a database used for search engine purposes. Elasticsearch is a machine that can perform word searches and analytics like count, sum and average the data. Elasticsearch is a development of Apache Lucene, so there is an indexing process that makes searching faster.
Elasticsearch can be implemented on Big Data, capable of querying in 0.2 seconds from 25 million amounts of data [10]–[12].
Elasticsearch has a web-based user interface (UI) that makes it easy for users to use and create analytics or visualization of the data. This web-based user interface is known as Kibana.
With kibana we can visualize data using charts, graphs, maps and tables [10]. There is a dashboard that can be used to display multiple visualizations to make the data more informative [10], [13].
2 Methodology
The research will be carried out several separate stages. The whole stage has many processes, therefore the stages will be splitted to easier to understand. With separate stages the research process is easier to do because each work only focuses on one stage. The whole stage can be seen in Figure 1.
Figure 1. Whole Stage
2.1 Model Building
The first step is make a model using one of the Deep Learning algorithms, namely Convolutional Neural Network. At this stage
several processes will be needed to produce the best model. Input at this stage is labeled data text. It will be used by model to learn at learning process. The output of this stage is a model that can classify sentiment of text.
At this stage, modeling will be done using the Convolutional Neural Network algorithm.
The stages will start from the collecting data that will be used. Then the data will be preprocessed to make it ready to be used in the process of making a model using CNN. Then the model will be evaluated whether it is optimal or not.
The whole stage can be seen in Figure 2.
Figure 2. Model Building Stage
2.1.1 Data Collection
The data collection stage is done by taking tweet data with sentiment label. Every data already contains a label or class from each twit.
Label of the data is a negative, positive or neutral sentiment. Data in the form of CSV files that have 2 columns namely text and sentiment.
The text column contains tweets from Twitter users and the sentiment column is the label of the data.
Experiment using 76066 total data with a composition of 41459 negative data, 8865 neutral data and 25742 positive data. In machine learning data is known as dataset. Not all of dataset will be used, only negative and positive dataset will be used. There will be several process to remove some character from dataset to clean the data. Examples of dataset that will be used in this experiment can be seen in Table.
1.
Table 1. Examples of dataset
Twit Sentiment
oi it bukan snsd kalau tidak tahu mn snsd yang mn cabe tidak
usah ngomong tukng fitnah memlukan indonesia saja u ih najis Negative
gua geram ame wgl t t bhakk Negative
pada akhirnya gua geram sama kelakuan gua Negative
3
Universitas Mercu Buana
Twit Sentiment
subhanallah berkah ramadhan luar biasa banget apa pun menu
makanan buka sahur selalu diberikan kenikmatan :) Positive nonton acra hafidz indonesia di rcti subhanallah luar biasa dan
smkin malu saya kalah sm anak kcil Positive
The label of the data is in the sentiment column.
This label will be used by CNN to learn the pattern to build the model. Labels on the data are determined by data analysts by determining opinions on the existing text. If the text is contains postive word, then the data will be labeled as positive. Whereas when there are negative word will be classified as negative.
2.1.2 Data Preprocessing
Training dataset is raw data and obtain unwanted character (dirty data). Dataset is not ready for modeling process using CNN. Dataset label is marked by "Positive", "Negative" or
"Neutral" word. It will be replaced with numbers consisting of 0 and 1, 0 for negative and 1 for positive dataset. The preprocessing data is explained in figure 3. This preprocessing stage is needed to make the data smaller and standrized.
Preprocessing is done so that the size of the text will be smaller and have a standardized [14].
Figure 3. Preprocessing Stage
The data obtained there are 3 different labels, namely positive, negative and neutral.
Data with neutral labels will be discarded because on the CNN model (this expriment) only two labels will be used. This pre-processing stage uses the Python programming language.
CSV file form data will be read then transform data into dataframes form. The process uses Pandas library to easier whole preprocessing stage.
Regular Expression (regex) data will be used to clean the dataset. Many characters from the text are deleted because they do not affect the value of the tweet. Deleting characters will make the model more accurate because it only processes the required data and remove some outliers. It makes the length of the text reduced
and the modeling process will be faster because of smaller datasets.
Regex used for removing "RT" word from text. The word "RT" in the text means that the data is retweeted from the previous tweet. Then the entire contents of the text will be changed to lowercase letters. With all the text in a lower case will avoid the same word is considered different because of capital differences.
Mention in the text will be deleted to reduce the size of the overall word used. Mention means to call other users in a text. Mention is marked by at symbol (@). Then the numbers and symbols are also deleted from the text. Text that contains next line or new line will be replaced with spaces. It makes datasets only one line text. The last cleanup is to delete multiple white spaces. Multiple spaces are spaces that have more than one number and spaces at the beginning or end of the text.
The modeling process cannot be done on data in the form of text or sentences. Like a computer, model also only understands data in the form of numbers. Therefore the text will be changed using the help of the tokenization process. The tokenization process will break up text into words, phrases or even sentences that are usually called as tokens.
This expreiment will split the datasets into several part based on space separators. Each part is called a token, then token will be converted to a number. Each number will represent one unique it makes one word and another have different numbers. An example of the tokenization process can be seen in Figure 4.
Figure 4. Tokenization Example
The library used for the tokenization process is Tokenizer derived from the Keras Preprocessing Text library. The total number of words used by the paramenter is 5000 words.
Which means that all data used has a unique number of words not more than 5000.
CNN accepts input data to be processed as a list containing numbers (array of numbers).
Each text of the data will be converted into an array of numbers. CNN only accepted same data length as an input for form the layer. Therefore
http://digilib.mercubuana.ac.id/
4
Universitas Mercu Buana the length of each data is standardized to 100
words, if data text is less then 100, text will be added 0 behind and makes the length of the data becomes 100.
2.1.3 CNN Model Building
Convolutional neural networks are the development of neural networks. The whole process is same as the neural network that uses interconnected neurons. The differences is CNN has a Convolution layer as shown in Figure 5.
This layer will filter the data and makes output of this layer is more interpretation of the contents of the data. At the convolutional layer data will be divided based on the amount specified (kernel size) then pooling the data.
Figure 5. Convolutional Neural Network Pooling is the process of calculating input data according to the inputed kernel size. In this experiment max-pooling is used, which means that the largest number of the data will be taken.
Figure 6 shows how the pooling process is carried out on the data. Experiment uses Convolutional Layer 1D because the data is a list of numbers which are 1-dimensional data.
Figure 6. Global Max-Pooling
The next layer is the dense layer. This layer will reduce the number of neurons according to the parameters inputed. In this experiment used 2 layers of dense. The first dense layer uses the parameter number of neurons as much as 10.
Thus the output of this layer is 10 neurons. This layer uses the ReLU activation function. The last layer uses the parameter number of neurons as much as 1 with the activation function sigmoid.
In the first dense layer, the Rectified Linear Unit activation function is used or often called as ReLU. ReLU has a formula which can be seen in Figure 7. In the formula can be seen if ReLU only changes numbers to be positive. If the input number is negative it will be replaced with 0. Based on the graph, the output of ReLU can be seen from 0 to infinity.
The last layer of dense uses the parameter number of neurons 1 and the sigmoid activation function is used. It can be seen in the figure 7 (sigmoid part), the output of the calculation is a number from 0 to 1. Sigmoid is very helpful when making a model that has two targets or labels.
Figure 7. Sigmoid and ReLU
In the last layer will produce an output in the form of numbers from 0 to 1. From these numbers can be returned into the form of positive or negative words. According to preprocessing stage label conversion (0 for negative data and 1 for positive data), if the data is less than 0.5 then the data will be marked as negative sentiment. If the data is above 0.5 then the data will be marked as positive sentiment.
2.1.3 Model Evaluation
There are many parameters that can be used to build a model. Whereas the combination of data preprocessing can be done with various processes. Therefore, each builded model will be evaluated for the accuracy to clasify the data. If the accuracy results is weak, the previous processes will be analyzed. Then the model will be remade to get optimal results. This process is repeated continuously until the most optimal model is produced.
2.2 Election Data Classification
At this stage Twitter data about 2019 Presidential Election in Indonesia will be processed. As shown in Figure 6, the process starts from data collection to storing data to database. There are several processes that run
5
Universitas Mercu Buana concurrently with CNN model building in
Figure 2. This approach is done because the CNN model making process takes some time. It makes experiment more efficient to split the processes. The whole process uses the Python programming language.
Figure 6. Election Data Classification Stage
2.2.1 Election Data Collection
Data is taken from Twitter directly using a program created in Python programming language. Get the data using API provided by Twitter. The data taken is data containing hashtags related to the 2019 Presidential Election. The data collection process will be divided into 2 parts, get data contains Jokowi and Prabowo. Jokowi and Prabowo is presidential candidate in 2019 presidential election.
The process of getting data using library requests. This library is used to access the HTTP API provided by Twitter. The data request process using this API will require a token as an authentication process. This token is obtained by registering a Twitter account into a Twitter Developer account.
1. base_url = 'https://api.twitter.com/'
2. auth_url = '{}oauth2/token'.format(base_url) 3.
4. auth_headers = {
5. 'Authorization': 'Basic {}'.format(b64_enc oded_key),
6. 'Content-Type': 'application/x-www-form- urlencoded;charset=UTF-8'
7. } 8.
9. auth_data = {
10. 'grant_type': 'client_credentials' 11. }
12.
13. auth_resp = requests.post(auth_url, headers=
auth_headers, data=auth_data) 14.
15. auth_resp.status_code 16.
17. auth_resp.json().keys() 18.
19. access_token = auth_resp.json()['access_tok en']
20.
21. search_headers = {
22. 'Authorization': 'Bearer {}'.format(access_
token) 23. } 24.
25. search_params = { 26. 'query': 'Pilpres2019' 27. }
28.
29. search_url = '{}1.1/tweets/search/fullarchive /dev.json'.format(base_url)
30.
31. tweet_data = requests.get(search_url, header s=search_headers, params=search_params).j son()
2.2.2 Data Mapping
The data output from the program is a file in the form of a JSON file. This JSON file is still save raw data which contains many unneeded fields. The JSON file will be processed to be mapped into the data needed only. The fields to be retrieved are created_at, text and username.
1. {
2. "results": [ 3. {
4. "created_at": "Fri Apr 26 14:46:53 +000 0 2019",
5. "id": 1177823455686746000, 6. "id_str": "1177823455686746113", 7. "text": "RT @covesiacom:
@infobencana Jokowi Lebih Mudah Atasi Banjir kalau Jadi Presiden|
http://t.co/D2oCQ5WUJI http://t.co/ob8wlFNmgu
ob8wlFNmgu lewat @shar…",
8. "source": "<a href=\"http://twitter.com/d ownload/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"
9. ...
10. ...
11. }
http://digilib.mercubuana.ac.id/
6
Universitas Mercu Buana 12. ]
13. }
This text will be used later for the classification process that will generate sentiments from each data. The process used is ETL (Extract Transform Load) using the Python programming language [15]. This process is done by extracting or reading data for each JSON file. Then data is transformed as needed.
Then the clean data will be loaded or ingested to a database or stored in a file after the data classification process.
1. {
2. "created_at": "2019-04-26 14:46:53", 3. "text": "RT @covesiacom: @infobencana J
okowi Lebih Mudah Atasi Banjir kalau Jadi Presiden| http://t.co/D2oCQ5WUJI http://t.c o/ob8wlFNmgu lewat @shar…",
4. "clean_text": "jokowi lebih mudah atasi ba njir kalau jadi presiden httptcodocqwuji http tcoobwlfnmgu lewat",
5. "sentiment": "positive",
6. "retweeted_screen_name": "covesiacom", 7. "user_screen_name": "JaenudinDown"
8. }
2.2.3 Data Preprocessing
Data taken from Twitter has raw text that contains a lot of unnecessary characters. The text will be preprocessed until ready for the classification process. After cleaning by removing some of the characters that are not needed then the data will be changed to a token.
The process is done the same as the preprocessing stage in Figure 3. Clean data text will be stored in a separate field to facilitate the classification process.
Table 2 and 3 describe the detail of preprocessing process. From the raw data that collected from mapping process to the clean data for classification process.
Table 2. Preprocessing Raw Data
No created_at text retweete
d_screen _name
user_sc reen_n ame 1
Mar 30, 2019 @ 13:50:24.0 00
RT @yusuf_dumdum:
Pak @prabowo bilang Pancasila sudah final.
Kalau begitu mungkin
yusuf_d
umdum jcklung
No created_at text retweete
d_screen _name
user_sc reen_n beliau bisa melarang ame
bendera terlarang HTI saat kampanyen…
2
Mar 30, 2019 @ 13:50:24.0 00
RT @Gerindra:
Prabowo: Bagi kami, Pancasila adalah ideologi final. Kami bertekad untuk mempertahankan Pancasila sampai titik darah yang ter…
Gerindra 98rahmatnur
3
Mar 30, 2019 @ 13:50:24.0 00
RT @MataNajwa:
@narasitv @matakitaid
@bumnbersatu
@beneran_id
@prabowo @jokowi
\"Ibu saya Nasrani, saya sejak 18 tahun sudah membela Pancasi…
MataNaj
wa mhmm
dbacht
4
Mar 30, 2019 @ 13:50:24.0 00
RT @zarazettirazr:
Saya ingin bertanya apakah pak @jokowi sadar dan tau mengerti bahwa diantara pendukung2 pak jkw ada yg memfitnah saya se…
zarazetti
razr rillama ria_
5
Mar 30, 2019 @ 13:50:24.0 00
RT @putrabanten80:
Awal Debat Yang Keren Dari Pak Prabowo...Tegas, Lugas Mudah Dimengerti Dan Sesuai Keadaan Dilapangan.
#PrabowoBentengNK
…
putraban
ten80 andarz _efyu
Table 3. Preprocessing Clean Data
No created_at text retweete
d_screen _name
user_sc reen_n ame
1 2019-03-30 13:50:24
pak bilang pancasila sudah final kalau begitu mungkin beliau bisa melarang bendera terlarang hti saat kampanyen
yusuf_d
umdum jcklung
2 2019-03-30 13:50:24
prabowo bagi kami pancasila adalah ideologi final kami bertekad untuk mempertahankan pancasila sampai titik darah yang ter
Gerindra 98rahmatnur
3 2019-03-30 13:50:24
ibu saya nasrani saya sejak tahun sudah membela pancasi
MataNaj
wa mhmm
dbacht
4 2019-03-30 13:50:24
saya ingin bertanya apakah pak sadar dan tau mengerti bahwa diantara pendukung pak jkw ada yg memfitnah saya
zarazetti
razr rillama ria_
5 2019-03-30 13:50:24
awal debat yang keren dari pak prabowotegas lugas mudah dimengerti dan sesuai keadaan
putraban
ten80 andarz _efyu
7
Universitas Mercu Buana
No created_at text retweete
d_screen _name
user_sc reen_n dilapangan ame
prabowobentengnk
2.2.4 Election Data Classification
Similar to the modeling process, the data will be transformed into an array of number (tokens). Clean data text in the form of tokens are then classified using the model that CNN has created. The field used to determine sentiment is the clean text field. After the classification process, the input is sentiment in word (text) form (positive or negative), they will be stored in the sentiment field.
2.2.4 Data Ingestion
Elasticsearch database can store data in the form of JSON data. After classification process data will be converted into JSON form. Python programming language data dictionary types can be used. The dictionary has the same structure as the JSON data.
When the data is ready, the next process is to ingest or insert the data to the elasticsearch database. The library used for this process is an elasticsearch-client for the Python programming language.
3 Experiments and Results
In the process of making a model, several experiments were carried out. Many experiments are done to get the most optimal accuracy of the CNN model. In the process of building models many parameters can be determined, some of which are Epoch and Batch. The values of these parameters can be combined to get an optimal model results. There is no exact reference to get an optimal model, each case will different depending on the data used. Detail of the several experiments are can be seen in table 2.
Table 2. Experiment Result
No Epoch Batch Training
Accuracy Testing Accuracy
No Epoch Batch Training
Accuracy Testing Accuracy
1 5 10 0.9958 0.9216
2 10 10 0.9984 0.9169
3 1 10 0.9564 0.9262
From the table, models with epoch 1 and batch 10 are used because they have the best accuracy. The model will be saved as a file with a .model extension. Pickle library is used to store the model result. This model will be used for the classification process of Twitter data which will determine the sentiments of the text.
The data stored in Elasticsearch then visualized based on sentiments and text content.
The contents of the text will be filtered based on the words jokowi and prabowo. From this filter data will be visualized using a bar or circle diagram. The diagram contains information on the amount of data that is divided based on existing sentiments. Figure 7 shows the process of making a diagram on kibana.
Figure 7. Creating Kibana Diagram From the whole data can be described with some visualization. This visualization is based on data from Twitter and make the scope of the data is very limited. Describing the opinions of Twitter users does not include represent the entire community in Indonesia. Data also does not include data from all Twitter users in Indonesia due to limitations in the data collection process. Visualization of the overall data can be seen in figures 8 and 9.
http://digilib.mercubuana.ac.id/
8
Universitas Mercu Buana Figure 8. All Data Count by Sentiment, Jokowi
Data Count by Sentiment
Figure 8. Comparison Jokowi and Prabowo Sentiment, Prabowo Data Count by Sentiment
From diagram the data contains more negative sentiment as shown in Figure 8. Data describe that both data containing the word jokowi and prabowo are more dominated by data with negative sentiment. When viewed using a bar chart, the number of tweets can be seen more clearly, the more data means higher popularity. The high level of popularity does not reflect positive opinion from the Twitter users if the contents of the text have negative sentiments.
4 Conclusion
Sentiment analysis can be used to see the polarity of people's views about something. This can be used to monitor the reputation of the product, public figures and all figures who have a major influence on society.
Convolutional Neural Network algorithm can be used for the sentiment analysis process on text data (Twitter) with an accuracy of 0.9262. Data used in this experiment is data about Indonesian election writed in Bahasa (Indonesian Language).
The Elasticsearch database can be used as data analytics and visualization. Kibana can help visualization process and facilitate the use of Elasticsearch Database.
From collected data can explain that the public opinions about election is dominated by
negative opinions. It illustrates that many pro and cons to presidential candidate in the election process. Entity which have more data is more popular than the small one. But it's not guarantee that entity have good reputation on society. for example when the data have more negative sentiment mean the entity have bad reputation.
References
[1] Z. Jianqiang, G. Xiaolin, and Z. Xuejun,
“Deep Convolution Neural Networks for Twitter Sentiment Analysis,” IEEE Access, vol. 6, no. c, pp. 23253–23260, 2018.
[2] E. Indrayuni, “Komparasi Algoritma Naive Bayes Dan Support Vector Machine Untuk Analisa Sentimen Review Film,” J. Pilar Nusa Mandiri, vol. 14, no. 2, p. 175, 2018.
[3] A. W. Attabi, L. Muflikhah, and M. A.
Fauzi, “Penerapan Analisis Sentimen untuk Menilai Suatu Produk pada Twitter Berbahasa Indonesia dengan Metode Naïve Bayes Classifier dan Information Gain,” J. Pengemb. Teknol. Inf. dan Ilmu Komput. Univ. Brawijaya, vol. 2, no. 11, pp. 4548–4554, 2018.
[4] M. Sadikin, M. I. Fanany, and T.
Basaruddin, “A New Data
Representation Based on Training Data Characteristics to Extract Drug Name Entity in Medical Text,” Comput. Intell.
Neurosci., vol. 2016, 2016.
[5] C. Rangu, S. Chatterjee, and S. R.
Valluru, “Text Mining Approach for Product Quality Enhancement,” 2017.
[6] V. A. and S. S. Sonawane, “Sentiment Analysis of Twitter Data: A Survey of Techniques,” Int. J. Comput. Appl., vol.
139, no. 11, pp. 5–15, 2016.
[7] N. Rusk, “Deep learning,” Nature Methods. 2015.
[8] J. Ahmad, H. Farman, and Z. Jan, “Deep Learning Methods and Applications,” in SpringerBriefs in Computer Science, 2019.
[9] W. Yin, K. Kann, M. Yu, and H.
Schütze, “Comparative Study of CNN and RNN for Natural Language Processing,” 2017.
[10] V. Sharma, Beginning Elastic Stack.
2016.
9
Universitas Mercu Buana [11] D. Chen et al., “Real-time or near real-
time persisting daily healthcare data into HDFS and elasticsearch index inside a big data platform,” IEEE Trans. Ind.
Informatics, vol. 13, no. 2, pp. 595–606, 2017.
[12] U. Thacker, M. Pandey, and S. S.
Rautaray, “Review of Elasticsearch Performance Variating the Indexing Methods,” pp. 281–286, 2018.
[13] M. Bajer, “Building an IoT data hub with elasticsearch, Logstash and Kibana,”
Proc. - 2017 5th Int. Conf. Futur.
Internet Things Cloud Work. W-FiCloud 2017, vol. 2017-Janua, pp. 63–68, 2017.
[14] D. Virmani and S. Taneja, A text preprocessing approach for efficacious information retrieval, vol. 669. Springer Singapore, 2019.
[15] I. M. S. Putra and D. K. T. Adhitya Putra,
“Rancang Bangun Engine ETL Data Warehouse dengan Menggunakan Bahasa Python,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 3, no. 2, pp. 113–123, 2019.