Aspect-Based Sentiment Analysis on Twitter Using Long Short-Term Memory Method

(1)

DOI: 10.30865/mib.v7i2.5637

Aspect-Based Sentiment Analysis on Twitter Using Long Short-Term Memory Method

Siti Inayah Putri, Erwin Budi Setiawan^*, Yuliant Sibaroni

School of Computing, Informatics Study Program, Telkom University, Bandung, Indonesia Email: ¹sitiinayahputri@student.telkomuniversity.ac.id, ^2,*erwinbudisetiawan@telkomuniversity.ac.id,

3yuliant@telkomuniversity.ac.id

Correspondence Author Email: erwinbudisetiawan@telkomuniversity.ac.id

Abstract−Twitter is one of the most popular social media among Indonesian people. Due to the high number of users and the intensity of their use, Twitter can also be used to dig up information related to a topic or product with sentiment analysis. One of the most frequently discussed topics on Twitter is related to movie reviews. Everyone's opinion of movie reviews can refer to different aspects. So, aspect-based sentiment analysis can be applied to movie reviews to get more optimal results. Aspect- based sentiment analysis is a solution to find out the opinions of Twitter users on movie reviews based on the aspects. In this study, a system for aspect-based sentiment analysis was built with a dataset of Indonesian language movie reviews consisting of 3 aspects: plot, acting, and director. The classification model uses Long Short-Term Memory (LSTM) method with the application of TF-IDF feature extraction, fastText feature expansion, and handling of imbalanced data using SMOTE. The results of this study for the plot aspect obtained an accuracy score of 74.86% and F1-score of 74.74%, the acting aspect obtained an accuracy score of 94.80% and F1-score of 94.74%, and the director aspect obtained an accuracy score of 94.02% and F1- score of 93.89%.

Keywords: Aspect-Based Analysis Sentiment; Movie Review; LSTM; Fasttext; TF-IDF; SMOTE

1. INTRODUCTION

Technological developments produce many new things that can simplify human life, including social media. Social media is where a person can interact with other people online and share opinions or information. One of the popular social media among Indonesian people is Twitter. Twitter is used by its users to share stories of their moments and exchange opinions on some topics [1]. The stories or opinions they share in the form of tweet sometimes contain abbreviations, sarcasm, or slang that are difficult to identify. So, it is necessary to do sentiment analysis to extract meaningful information from the tweet.

Sentiment analysis is extracting and identifying opinions and emotions from texts and then classifying them into positive, neutral, and negative. Sentiment analysis can be used to evaluate a product or service and consider it in decision-making [2]. Aspect-based sentiment analysis is a technique that focuses on identifying aspects of the text. By considering aspects, the results of sentiment analysis can be better and more detailed [3]. On the topic of movie reviews, there are several aspects that the viewers can comment on, such as plot, acting, and director.

Consumers or viewers can use the results of sentiment analysis on movie reviews to find out information related to certain movies with more specific ratings before they watch the movie and play an important role in selecting the movie they will watch [4]. Meanwhile, movie producers can use sentiment analysis to understand consumer or viewer views and evaluate the success of their movie production.

Sentiment analysis has developed using deep learning as its processing method. Deep learning is an innovation from machine learning that allows computers to understand more complex concepts by breaking them down into simpler ones to produce a broader understanding [5]. Sentiment analysis using deep learning has been widely done before. Lei Zhang et al. [6] and Li-Chen Cheng et al. [7] carried out sentiment analysis by applying several deep learning models. They concluded that deep learning models can effectively solve sentiment analysis problems with high accuracy values.

In 2019 Ashima Yadav and Dinesh Kumar Vishwakarma [8] carry out sentiment analysis research using several deep learning models, namely CNNs, Rec NNs, RNNs, LSTM, GRU, and Deep Belief Networks and their architecture. By using several datasets with different topics, such as consumer reviews on Google Play, reviews of books, electronic devices and kitchen equipment, reviews of restaurants, online products, hotels, and reviews of places. Through the sentiment analysis that has been done, they conclude that the LSTM model provides better results compared to other deep learning models.

Fenna Miedema [9] carry out a sentiment analysis study using the LSTM method regarding movie reviews using a dataset of 50,000 movie reviews from IMBD. Reviews are labeled with positive and negative labels, then trained and evaluated. The model that has been evaluated produces an accuracy value of 86.75%.

Another study by Saeed Mian Qaisar [10] discussed sentiment analysis for IMBD movie reviews using the LSTM method. The study used 50,000 movie reviews which were broken down into 25,000 reviews for training and 25,000 reviews for classifier. The two datasets contain 12,500 negative and 12,500 positive reviews. Using Adam optimizer at the preprocessing stage and Doc2Vec as word embedding, this research produces the highest accuracy value of 89.9%.

In research doing by Ravinder Ahujaa et al. [11] analyzed the impact of using TF-IDF and N-gram feature extraction in sentiment analysis using data originating from Twitter. By applying feature extraction to six

(2)

classification algorithms, they found that TF-IDF feature extraction works 3-4% higher than N-gram feature extraction. In addition, research by Robert Dzisevič and Dmitrij Šešok [12] also classified text using three different feature extraction techniques in a neural network and found that TF-IDF feature extraction works better than TF- IDF LSA and TF-IDF LDA when working with large dataset.

Emmanuella Anggi [13] compared the fastText and Glove word embedding methods in text classification using the LSTM method. This experiment shows that the performance results using fastText have an accuracy value of 83%, which is higher than Glove, with an accuracy value of 81%. Hanif Reangga Alhakiem and Erwin Budi Setiawan [14] have used fastText in aspect-based sentiment analysis using the logistic regression method, also using TF-IDF as feature extraction. They got the best final result obtained by calculating the F1-score of 96.48%.

Suja A. Alex et al [15] proposed using SMOTE with the LSTM deep learning method to predict diabetes.

SMOTE is used to handle imbalanced classes in datasets. The research compared their proposed method with other methods, namely CNN, CNN-LSTM, ConvLSTM, and the deep 1D-convolutional neural network (DCNN) technique. Models are analyzed using machine learning and deep learning approaches. The proposed model produced the highest prediction accuracy value, 99.64%.

Based on those articles, this study aimed to obtain the best performance value by using the LSTM classification method with the Indonesian language movie review dataset. In the system design, feature extraction is applied using the TF-IDF method by comparing the max feature parameters. Then, fastText feature expansion is applied to get better results. Then the SMOTE technique is applied to overcome the problem of imbalanced data.

To the author's knowledge, no research related to aspect-based sentiment analysis about movie reviews that implements the LSTM method using TF-IDF as feature extraction, fastText as feature expansion, and SMOTE.

With the hope that implementing all these methods produces the best model for an aspect-based sentiment analysis system using the LSTM classification method.

The dataset used in this study amounted to 17247 data in Indonesian with three categories of sentiment, namely positive, neutral, and negative. The dataset comes from Twitter using several keywords that are adjusted to the aspects of the movie review that have been determined, namely aspects of the plot, acting, and director.

2. RESEARCH METHODOLOGY

2.1 System Design

The system design of this study has several processes that shown in Figure 1. The system starts with the process of crawling data, that is the process of collecting data from Twitter. Then the data is labeled as positive, neutral, and negative. Then enter the preprocessing stage, feature extraction with TF-IDF, feature expansion with FastText, LSTM model, and performance evaluation using confusion matrix.

Figure 1. Flowchart of Aspect Based Sentiment Analysis System

2.2 Crawling Data

The process of crawling data uses the snsscape module available in Python. The data collected are tweets in Indonesian using several keywords according to the aspects that have been determined, as shown in Table 1. The

(3)

DOI: 10.30865/mib.v7i2.5637

Table 1. Keyword Every Aspect

Category Aspect Keyword

Plot plot, cerita, alur, jalan cerita, dialog, ending, skrip

Acting acting, akting, aktris, aktor, pemeran, pemain, karakter, performansi Director

sinematografi, director, pembuatan film, penyutradaraan, sinematik, direktor

2.3 Data Labelling

After the data is collected, the data will be labeled manually. The data are labelled in 3 aspects, there are plot, acting, and director. It will be classified into positive, neutral, and negative. If there is an aspect discussed, it will be labeled "1" for positive, "0" for neutral, and "-1" for negative. If this aspect is not discussed, it will be labeled

"0". The number of positive, neutral, and negative labels in each aspect is shown in Table 2.

Table 2. Number of Sentiment Labels for Each Aspect Category Aspect Positive Neutral Negative

Plot 6999 6566 3682

Acting 2907 13770 570

Director 1907 14753 587

2.4 Preprocessing Data

Preprocessing is a stage to prepare data before the classification process. The preprocessing stage in this study is divided into several stages: data cleaning, case folding, tokenizing, data normalization, stop word removal and stemming.

2.4.1 Data Cleaning

Data cleaning is the first preprocessing stage that removes punctuation marks, emojis, symbols, numbers, links, hashtags, mentions, tabs, and extra spaces.

2.4.2 Case Folding

Case folding is the stage of changing all letters to lowercase. For example, "SERU Banget Deh" will change to

"seru banget deh" after going through the case folding stage.

2.4.3 Tokenizing

Tokenizing is the stage for breaking sentences into groups of words called tokens. For example, the sentence "seru banget deh" will turn into tokens: "seru", "banget", "deh".

2.4.4 Data Normalization

Data normalization is a step to identify the writing of excess or non-standard words, which changed to writing according to KBBI. For example, "gw" is changed to "saya", "oiyaaa" is changed to "oh iya", "tyda" is changed to

"tidak".

2.4.5 Stop Word Removal

Stop word removal is the process of removing words that are considered unimportant. These unimportant words are words that do not have a specific meaning, for example, "yang", "ya", "di", and "itu".

2.4.6 Stemming

Stemming is the stage of changing words that have affixes (suffixes and prefixes) into essential words. For example, "filmnya" changed to "film" and "melihat" changed to "lihat".

2.5 Feature Extraction

Feature extraction is used to represent words in numbers vector form. Feature extraction is carried out using the TF-IDF method. TF-IDF serves to give weight to each word in the document. TF (Term Frequency) of a term (t) is calculated by how often a term appears in a document to the total number of words in the document [11]. IDF (Inverse Document Frequency) is used to calculate the importance of a term by calculating the number of documents in dataset (D) divided by the number of documents containing the term t (𝑑𝑓).

(4)

𝐼𝐷𝐹(𝑡) = 𝑙𝑜𝑔^𝐷

𝑑𝑓 (1)

So, the TF-IDF is formulated as follows.

𝑇𝐹 𝐼𝐷𝐹 = 𝑇𝐹(𝑡) × 𝐼𝐷𝐹(𝑡) (2)

2.6 Handling Imbalanced Data

From the previous data labeling results in Table 2, it is known that the data labels in each aspect need to be balanced. SMOTE is used to handle the unbalanced data. The basis of SMOTE (Synthetic Minority Oversampling Technique) is to interpolate between adjacent minority class examples. So, it can increase the number of minority class samples by introducing new minority class examples in the surrounding environment and help the classification process by increasing its generalization capacity [16]. Handling unbalanced data is expected to increase the performance value of the model created.

2.7 Feature Expansion

FastText is a word embedding method. FastText is a development of Word2Vec. FastText overcomes the problem of Word2Vec unable to learn the representation of rare or non-standard words [17]. FastText uses information from sub words explicitly so rare words can be appropriately embedded. FastText represents words with N-gram characters [18]. So fastText can understand shorter words, as well as understand suffixes and prefixes in words.

The fastText feature expansion process requires a corpus containing a collection of words with similar values as a dictionary. The corpus is made with the Gensim library in python with the fastText model. Three corpuses are created: the Twitter corpus, the News corpus, and the Twitter + News corpus. The Twitter corpus is derived from the movie review dataset from Twitter in Indonesian. The News corpus is derived from the news dataset from Twitter in Indonesian. The Twitter + News corpus is derived from a combination of the movie review dataset and news dataset. The total data in each fastText corpus is shown in Table 3.

Table 3. Total Corpus fastText Corpus Total

Twitter 7296

News 86853

Twitter + News 89119 2.8 Long Short-Term Memory Classification

Long Short-Term Memory (LSTM) is one of the deep learning models. LSTM is the development of the Recurrent Neural Network (RNN) model. The RNN has some memory but fails if the data has longer dependencies.

Meanwhile, LSTM can use loops with the addition of gates to maintain relevance and keep related data from being lost in very long sequences [19]. So, the advantage of LSTM is being able to study long-term dependencies. The LSTM has a memory cell and gate units for each neuron.

Figure 2. LSTM Model Schematic

Like neural networks in general, LSTM has a repeating neural network model but has a different structure, which consists of four gates, as shown in Figure 2. The four gates are forgotten gates, input gates, cell gates, and output gates. The four gates are optional paths through which all information passes. Within these gates are layers of sigmoid functions and pointwise multiplication operations. The cell gates are the core of the LSTM structure, while the other three gates protect and control the cell gates [20].

In forget gates, data information is processed and selected to be stored or discarded in memory cells using the sigmoid function. This process serves to ensure that the update weight is manageable. This process results in a value of "1" if all data is stored and a value of "0" if all data is discarded.

(5)

DOI: 10.30865/mib.v7i2.5637

Input gates function to decide the value to be updated using the sigmoid function. Then create a new vector value stored in the memory cell using the tanh function.

𝑖𝑡= 𝜎(𝑊𝑖[ℎ_𝑡−1, 𝑥𝑡] + 𝑏_𝑖) (4)

𝑐̌_𝑡= 𝑡𝑎𝑛ℎ(𝑊_𝑐[ℎ_𝑡−1, 𝑥_𝑡] + 𝑏_𝑐) (5)

In cell gates, replace the old memory cell value with the new one. This value is a combination of forget gate and the input gate.

𝑐_𝑡= 𝑓_𝑡× 𝑐_𝑡−1+ 𝑖_𝑡× 𝑐̌_𝑡 (6)

The last process is output gates. Here it is decided that the value of the memory cell will be issued using the sigmoid function. Then do the placement of memory cell values using the tanh function.

𝑜_𝑡= 𝜎(𝑊_𝑜[ℎ_𝑡−1, 𝑥_𝑡] + 𝑏_𝑜) (7)

ℎ_𝑡= 𝑜_𝑡𝑡𝑎𝑛ℎ(𝑐_𝑡) (8)

2.9 Performance Evaluation

The evaluation stage is carried out to measure the performance value using the confusion matrix. The confusion matrix is useful for displaying the classification results in the form of actual values and predicted values. Table 4 represents the confusion matrix.

Table 4. Confusion Matrix Confusion matrix Actual Values

Positive Negative Predicted

Values

Positive TP FP

Negative FN TN

There are four terms in Table 4. True Positive (TP) if a positive prediction and the actual situation is true;

False Positive (FP) if a positive prediction and the actual situation is false; False Negative (FN) if a negative prediction and the actual situation is false; and True Negative (TN) if a negative prediction and the actual situation is true [21]. Then the performance value can be calculated using the accuracy, precision, recall, and F1-score formulas.

a. Accuracy is the value of the ratio of correctly predicted data compared to all data. The accuracy value is formulated as follows:

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^{𝑇𝑃+𝑇𝑁}

𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 (9)

b. Precision is the comparison value between positive values that are correctly predicted and all data that are predicted to be positive. The precision value is formulated as follows:

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ^𝑇𝑃

𝑇𝑃+𝐹𝑃 (10)

c. Recall is the value of the ratio of true positive prediction data compared to all existing positive actual data.

The recall value is formulated as follows:

𝑟𝑒𝑐𝑎𝑙𝑙 = ^𝑇𝑃

𝑇𝑃+𝐹𝑁 (11)

d. F1-Score is a performance metric that takes recall and precision into account. F1-score is formulated as follows:

𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 (𝑟𝑒𝑐𝑎𝑙𝑙 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)

(𝑟𝑒𝑐𝑎𝑙𝑙 + 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) (12)

3. RESULT AND DISCUSSION

This study uses a dataset of Indonesian movie reviews with 17247 data. Each data has three aspects, namely plot, acting, and director. Each aspect of the data has a positive, neutral, or negative value label depending on the sentiment value in the data. This study has four test scenarios. The first scenario is to determine the baseline for the LSTM model by comparing the value of the train data and test data ratio. The second scenario is to implement TF-IDF feature extraction on the baseline. The third scenario is implementing the fastText feature expansion. The fourth scenario is the handling of imbalanced data using the SMOTE technique. The accuracy value and F1 score in each scenario are the average values of the five test results.

(6)

3.1 Scenario 1

Scenario 1 testing was carried out by comparing the ratio of train data and test data in the LSTM classification model. Each aspect was tested with a ratio of train data and test data of 90:10, 80:20, and 70:30 to determine which ratio has the highest accuracy for each aspect. The results obtained from scenario 1 are shown in Table 5.

Table 5. Testing Results of Scenario 1

Test Size Plot Acting Director

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score 0,1 66,48% 65,18% 87,99% 64,61% 86,81% 56,53%

0,2 64,99% 63,49% 88,19% 65,14% 85,46% 55,07%

0,3 64,29% 62,95% 87,28% 63,59% 86,33% 53,69%

In the plot aspect, the best results were obtained at a ratio of 90:10 with an accuracy value of 66.48% and an F1-score of 65.18%. In the acting aspect, the best results were obtained at a ratio of 80:20 with an accuracy value of 88.19% and an F1-score of 65.14%. In the director aspect, the best results were obtained at a ratio of 90:10 with an accuracy value of 86.81% and an F1-score of 56.53%. The best ratio value in each aspect becomes the baseline for the following scenario.

3.2 Scenario 2

Scenario 2 testing was carried out by comparing the max feature value of the TF-IDF feature extraction implementation at the baseline. The max feature values used as comparisons are 1000, 5000, and 10000. The max feature baseline applies the max feature value as much as the amount of data train on the baseline specified in scenario 1. The results obtained from scenario 2 are shown in Table 6.

Table 6. Testing Results of Scenario 2 Max

Feature

Plot Acting Director

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Baseline 68,07%

(+1,59%)

66,62%

(+1,44%)

88,09% (- 0,1%)

60,12% (- 5,02%)

85,81% (- 1,00%)

50,95% (- 5,58%)

1000 67,18%

(+0,70)

65,15% (- 0,03%)

89,32%

(+1,33%)

66,68%

(+2,07%)

89,18%

(+2,37%)

57,70%

(+1,17%)

5000 68,65%

(+2,17%)

67,21%

(+2,03%)

87,77% (- 0,22%)

58,85% (- 5,76%)

86,83%

(+0,02%)

51,75% (- 4,78%)

10000 67,71%

(+1,23%)

66,51%

(+1,33)

87,68% (- 0,31%)

60,31% (- 4,30%)

85,83% (- 0,98%)

50,54% (- 5,99%) In the plot aspect, the best max feature value is 5000, with an accuracy value of 68.65% and an F1-score of 67.21%. In the acting aspect, the best max feature score is 1000, with an accuracy value of 89.32% and an F1- score of 66.68%. In the director aspect, the best max feature value is 1000, with an accuracy value of 89.18% and an F1-score of 57.70%. The best max feature value in each aspect is used in the following scenario.

3.3 Scenario 3

Testing scenario 3 applies the fastText feature expansion to the best results from scenario 2. The corpus used in implementing the fastText feature expansion is the Twitter corpus, the News corpus, and the Twitter + News corpus. The corpus contains a collection of words that have similar values. Scenario 3 compares the Top 1, Top 5, Top 10, and Top 20 values of the similarity values of each word applied to all corpuses.

Table 7. Testing Results of Scenario 3 for Plot Aspect

Top Corpus Twitter Corpus News Corpus Twitter + News Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score 1 70,25% 68,88% 69,88% 68,25% 70,17% 68,47%

(+1,60%) (+1,67%) (+1,23%) (+1,04%) (+1,52%) (+1,26%)

5 68,87% 67,41% 69,89% 68,12% 70,62% 69,02%

(+0,22%) (+0,20%) (+1,24%) (+0,91%) (+1,97%) (+1,81%)

10 67,12% 65,39% 69,30% 67,54% 70,21% 68,36%

(-1,53%) (-1,82%) (+0,65%) (+0,33%) (+1,56%) (+1,15%)

20 62,81% 61,01% 70,18% 68,34% 69,40% 67,12%

(-5,84%) (-6,20%) (+1,53%) (+1,13%) (+0,75%) (-0,09%)

(7)

DOI: 10.30865/mib.v7i2.5637

Table 8. Testing Results of Scenario 3 for Acting Aspect

Top Corpus Twitter Corpus News Corpus Twitter + News Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score 1 89,46% 66,15% 89,49% 66,83% 89,45% 66,62%

(+0,14%) (-0,53%) (+0,17%) (+0,15%) (+0,13%) (-0,06%)

5 88,64% 64,84% 89,34% 65,92% 89,49% 66,31%

(-0,68%) (-1,84%) (+0,02%) (-0,76%) (+0,17%) (-0,37%)

10 88,61% 63,75% 89,23% 65,39% 89,33% 65,56%

(-0,71%) (-2,93%) (-0,09%) (-1,29%) (+0,01%) (-1,12%)

20 88,50% 60,82% 89,11% 65,35% 89,05% 65,19%

(-0,82%) (-5,86%) (-0,21) (-1,33%) (-0,27%) (-1,49%)

In the acting aspect shown in Table 8, using the News corpus with Top 1 gets the best results with an accuracy value of 89.49% and an F1-score of 66.83%.

Table 9. Testing Results of Scenario 3 for Director Aspect

Top Corpus Twitter Corpus News Corpus Twitter + News Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score 1 89,18% 58,34% 89,36% 57,39% 89,48% 58,46%

0,00% (+0,64%) (+0,18%) (-0,31%) (+0,30%) (+0,76%)

5 88,81% 53,40% 89,47% 56,69% 89,40% 58,06%

(-0,37%) (-4,30%) (+0,29%) (-1,01%) (+0,22%) (+0,36%)

10 89,39% 56,73% 89,48% 56,49% 89,36% 56,64%

(+0,21%) (-0,97%) (+0,30%) (-1,21%) (+0,18%) (-1,06%)

20 89,17% 52,20% 89,39% 56,10% 89,01% 53,96%

(-0,01%) (-5,50%) (+0,21%) (-1,60%) (-0,17%) (-3,74%)

In the director aspect shown in Table 9, using the Twitter + News corpus with the Top 5 gets the best results with an accuracy score of 89.48% and an F1-score of 58.46%.

3.4 Scenario 4

Scenario 4 testing is the application of SMOTE to the best results from scenario 3. SMOTE is used to overcome imbalanced data problems. The results of scenario 4 are shown in Table 10. The accuracy value and F1-score increased significantly in all aspects.

Table 10. Testing Results of Scenario 4

Model Plot Acting Director

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Without SMOTE 70,62% 69,02% 89,49% 66,83% 89,48% 58,46%

With SMOTE 74,86% 74,74% 94,80% 94,74% 94,02% 93,89%

(+4,24%) (+5,72%) (+5,31%) (+27,91%) (+4,54%) (+35,43%) 3.5 Discussion

In this research, four testing scenarios have been carried out using the Long Short-Term Memory (LSTM) classification model with the application of TF-IDF feature extraction, fastText feature expansion, and SMOTE.

Accuracy values and F1-scores in all scenarios are the average values of 5 tests.

Figure 3. Graph of Performance Improvement in Plot Aspects

(8)

After going through four scenarios, the plot aspect gets the best performance with a baseline ratio of 90:10, TF-IDF feature extraction with a max feature value of 5000, and the fastText feature expansion with Top 5 on the Twitter + News corpus, and handling unbalanced data with SMOTE. Figure 3 shows a graph of the increase in performance values from the baseline plot aspect for each scenario. The final performance value obtained on the plot aspect is an accuracy value of 74.86% and an F1-score of 74.74%.

Figure 4. Graph of Performance Improvement in Acting Aspects

After going through four scenarios, the acting aspect gets the best performance with a baseline ratio of 80:20, TF-IDF feature extraction with a max feature value of 1000, the fastText feature expansion with Top 1 in the News corpus, and the handling of unbalanced data with SMOTE. Figure 4 shows a graph of the increase in performance values from the acting aspect baseline for each scenario. The final performance value obtained in the acting aspect is an accuracy value of 94.80% and an F1-score of 94.74%.

Figure 5. Graph of Performance Improvement in Plot Aspects

After going through four scenarios, the director aspect gets the best performance with a baseline ratio of 90:10, TF-IDF feature extraction with a max feature value of 1000, and the fastText feature expansion with Top 1 on the Twitter + News corpus, and handling unbalanced data with SMOTE. Figure 5 shows a graph of the increase in performance values from the director aspect baseline for each scenario. The final performance value obtained for the director aspect is an accuracy value of 94.02% and an F1-score of 93.89%.

Based on the test results, all scenarios affect the performance value of the LSTM classification model.

Scenario 2 affects increasing performance values because all data is transformed into vectors with weighted values from the TF-IDF feature extraction process. Then the application of the fastText feature expansion in scenario 3 makes the word with a vector value of 0 turn into the enormous similarity value of the word. FastText implementation also affects the performance values, although not significant. Moreover, from scenario 4 there is a significant increase in performance values due to the implementation of SMOTE. The SMOTE technique balances the number of data labels by utilizing data sampling to create artificial samples on minority labels. So, the balance on the data label greatly affects the performance results of a model.

4. CONCLUSION

In this study, aspect-based sentiment analysis was carried out on Indonesian language movie reviews, which consisted of 3 aspects: plot, acting, and director. The dataset comes from Twitter, with 17427 data labeled with positive, neutral, and negative labels. The dataset has gone through the preprocessing stage before entering the process of testing the LSTM classification model. The tests carried out consisted of four scenarios with the LSTM classification model. The baseline classification model is determined by comparing the ratio of train data and test data. Then scenario 2 applies TF-IDF feature extraction to the baseline with the max feature parameter. Then in scenario 3, the expansion of the fastText feature is implemented using three corpuses, namely the Twitter corpus, the News corpus, and the Twitter + News corpus. The corpus is used as a dictionary to change words with a vector

(9)

DOI: 10.30865/mib.v7i2.5637

increase in performance so that the balance of data greatly influences the performance results of a model. For the final performance on the plot aspect, an accuracy value of 74.86% and an F1-score of 74.74% are obtained by using a baseline ratio of 90:10, using TF-IDF feature extraction with a max feature value of 5000, using fastText feature expansion with Top 5 on the Twitter + News corpus, and handling of unbalanced data with SMOTE. For the final performance in the acting aspect, an accuracy value of 94.80% was obtained and an F1-score of 94.74%

using a baseline ratio of 80:20, using TF-IDF feature extraction with a max feature value of 1000, using the fastText feature expansion with Top 1 on the news corpus, and handling unbalanced data with SMOTE. Moreover, for the final performance on the aspect of the director obtained an accuracy value of 94.02% and an F1-score of 93.89%

by using a baseline ratio of 90:10, using TF-IDF feature extraction with a max feature value of 1000, using fastText feature expansion with Top 1 on the Twitter + News corpus, and handling of unbalanced data with SMOTE.

Suggestions for future research are to use more datasets with balanced target data labels and to use other variations of the LSTM model.

REFERENCES

[1] S. A. el Rahman, F. A. AlOtaibi, and W. A. AlShehri, “Sentiment Analysis of Twitter Data,” in 2019 international conference on computer and information sciences (ICCIS), IEEE, 2019, pp. 1–4.

[2] Z. Drus and H. Khalid, “Sentiment Analysis in Social Media and Its Application: Systematic Literature Review,”

Procedia Comput Sci, vol. 161, pp. 707–714, 2019, doi: 10.1016/j.procs.2019.11.174.

[3] F. Hemmatian and M. K. Sohrabi, “A survey on classification techniques for opinion mining and sentiment analysis,”

Artif Intell Rev, vol. 52, no. 3, pp. 1495–1545, Oct. 2019, doi: 10.1007/s10462-017-9599-6.

[4] N. S. Fathullah, Y. A. Sari, and P. P. Adikara, “Analisis Sentimen Terhadap Rating dan Ulasan Film dengan menggunakan Metode Klasifikasi Naïve Bayes dengan Fitur Lexicon-Based,” J. Pengemb. Teknol. Inf. dan Ilmu Komput, vol. 4, no. 2, pp. 590–593, 2020.

[5] B. N. Saha and A. Senapati, “Long Short Term Memory (LSTM) based Deep Learning for Sentiment Analysis of English and Spanish Data,” in 2020 International Conference on Computational Performance Evaluation (ComPE), IEEE, 2020, pp. 442–446.

[6] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A survey,” Wiley Interdiscip Rev Data Min Knowl Discov, vol. 8, no. 4, p. e1253, 2018.

[7] L. C. Cheng and S. L. Tsai, “Deep learning for automated sentiment analysis of social media,” in Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining, 2019, pp. 1001–1004.

[8] A. Yadav and D. K. Vishwakarma, “Sentiment analysis using deep learning architectures: a review,” Artif Intell Rev, vol. 53, no. 6, pp. 4335–4385, 2020.

[9] F. Miedema, “Sentiment Analysis with Long Short-Term Memory networks,” Vrije Universiteit Amsterdam, vol. 1, pp.

1–17, 2018.

[10] S. M. Qaisar, “Sentiment Analysis of IMDb Movie Reviews Using Long Short-Term Memory,” in 2020 2nd International Conference on Computer and Information Sciences (ICCIS), IEEE, 2020, pp. 1–4.

[11] R. Ahuja, A. Chug, S. Kohli, S. Gupta, and P. Ahuja, “The Impact of Features Extraction on the Sentiment Analysis,”

Procedia Comput Sci, vol. 152, pp. 341–348, 2019, doi: 10.1016/j.procs.2019.05.008.

[12] R. Dzisevič and D. Šešok, “Text Classification using Different Feature Extraction Approaches,” 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), 2019.

[13] E. Anggi, “Text Classification on Disaster Tweets with LSTM and Word Embedding | by Emmanuella Anggi | Towards Data Science,” 2020. https://towardsdatascience.com/text-classification-on-disaster-tweets-with-lstm-and-word- embedding-df35f039c1db (accessed May 23, 2022).

[14] H. R. Alhakiem and E. B. Setiawan, “Aspect-Based Sentiment Analysis on Twitter Using Logistic Regression with FastText Feature Expansion,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 5, pp. 840–846, Nov. 2022, doi: 10.29207/resti.v6i5.4429.

[15] S. A. Alex, N. Z. Jhanjhi, M. Humayun, A. O. Ibrahim, and A. W. Abulfaraj, “Deep LSTM Model for Diabetes Prediction with Class Balancing by SMOTE,” Electronics (Switzerland), vol. 11, no. 17, Sep. 2022, doi:

10.3390/electronics11172737.

[16] A. Fernández, S. García, F. Herrera, and N. v Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of artificial intelligence research, vol. 61, pp. 863–905, 2018.

[17] B. Athiwaratkun, A. G. Wilson, and A. Anandkumar, “Probabilistic FastText for Multi-Sense Word Embeddings,” 2018.

[18] B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating word embedding models: methods and experimental results,” APSIPA Trans Signal Inf Process, vol. 8, 2019.

[19] S. Seo, C. Kim, H. Kim, K. Mo, and P. Kang, “Comparative Study of Deep Learning-Based Sentiment Classification,”

IEEE Access, vol. 8, pp. 6861–6875, 2020, doi: 10.1109/ACCESS.2019.2963426.

[20] F. Landi, L. Baraldi, M. Cornia, and R. Cucchiara, “Working Memory Connections for LSTM,” Neural Networks, vol.

144, pp. 334–341, Dec. 2021, doi: 10.1016/j.neunet.2021.08.030.

[21] A. Suresh, “What is a confusion matrix?,” 2020. https://medium.com/analytics-vidhya/what-is-a-confusion-matrix- d1c0f8feda5 (accessed May 15, 2022).