Sentiment Analysis of Flip App Users on Google Play Using Naïve Bayes and SVM with SMOTE

(1)

Sentiment Analysis Review Flip App Users on Google Play Using Naïve Bayes Algorithm and Support Vector Machine

with Smote Technique

Hermanto

^1,^a)

Taufik Asra

^2,^b)

Antonius Yadi Kuntoro

^3,^c)

Riza Fahlapi

^4,^d)

Lasman Effendi

^1,^e)

and Ferry Syukmana

^4,^f)

1)Teknologi Komputer, Universitas Bina Sarana Informatika, Jakarta, Indonesia

2)Rekayasa Perangkat Lunak, Universitas Bina Sarana Informatika, Jakarta, Indonesia

3)Sistem Informasi,Universitas Nusa Mandiri, Jakarta, Indonesia

4)Teknologi Informasi, Universitas Bina Sarana Informatika, Jakarta, Indonesia

a)Corresponding Author: [email protected]

b)Electronic mail: [email protected]

c)Electronic mail: [email protected]

d)Electronic mail: [email protected]

e)Electronic mail: [email protected]

f)Electronic mail: [email protected]

Abstract. The development of e-wallet is now increasingly sophisticated, can provide convenience to its customers in transacting anytime and anywhere just by using a smartphone. From some e-wallet products researchers took a case study that is FLIP products that are currently going viral, especially in Jakarta. Customers who are dissatisfied with a company’s services or products will typically write their complaints on social media or reviews on Google play.

However, monitoring and organizing public opinion is also not easy. Therefore, a special method or technique is required that is able to categorize the reviews automatically, whether including positive or negative. The algorithms used in this study were Naïve Bayes and Support Vector Machine with smote techniques. Naïve Bayes had an accuracy score of 64.55% with an AUC of 0.502 while Naive Bayes with smote technique gained 69.78% accuracy with an AUC of 0.506.

While SVM has an accuracy value of 65.00% with AUC 0.786, while SVM with smote technique has an accuracy value of 73.48% and AUC 0.836. The best optimization application in this model is SVM with smote technique can provide solutions to classification problems in the case of sentiment analysis of FLIP app user reviews.

INTRODUCTION

The digital age has led people to enter a new lifestyle that cannot be separated from electronic devices. Technology becomes a tool that helps human needs, with technology, anything can be done more easily. So important is the role of technology that began to bring civilization into the digital age. Increasingly sophisticated technology presents a variety of conveniences both in transportation, information, education to the ease of transactions in shopping, such as the development of E-wallets can now be easily done using a smartphone. E-wallet itself can be interpreted as a digital wallet or can be said electronic money to facilitate contracting in non-cash payments. To replace cash payments such as mobile payments, especially e-wallets, users need a willingness for them to use mobile payments in lieu of conventional payments. Analyzes the

(2)

adoption of mobile payments to trusts and states that trusts have a great influence on user adoption of mobile payments [2]. The influx of technology undeniably affects many facets of human life. Included in terms of buying and selling or using cash. In Indonesia there are many digital-based financial services and this includes e-wallets. With this service can make transactions for various purposes with the balance in the e-wallet. In Google Play comes with a feature containing reviews from users that can be used to view reviews from users of the application. User reviews are often used as an effective and efficient tool in finding information about a product or service. That recent research found that almost 50% of internet users rely on word-of-mouth recommendations before using a product, because reviews from other users can provide the latest information from those products based on the perspective of other users who are already using the product.

Customers or clients who are dissatisfied with the services or products offered by a company will usually write their complaint on social media or reviews on Google play. On the other hand, there are also satisfied customers, who express their positive attitude towards a product on social media or reviews on Google play. Whether you realize it or not, customer opinions written on social media or reviews on Google play, little or much, will have an impact on potential customers. The opinions posted on social media are too many to be processed manually.

Therefore, a special method or technique is required that is able to categorize the reviews automatically, whether including positive or negative. The amount of flip app user review data that can also be called a digital wallet that enters the Google Play site continues to grow over time, this makes it difficult for the company to obtain overall information from all reviews, because it will take a long time to read each review that comes in on the Google Play site page.

Lots of user reviews on Google Play regarding flipapp. A good brand image will form a good opinion from consumers about a product / service, and is expected to encourage the purchase process by consumers, and vice versa. The wide variety of responses on the Google Play site will of course affect the imagery of FLIP. Negative or positive responses from users may be influenced by some things that have not been a concern of FLIP. By using text mining can be seen what talks are often discussed by users. System analysis has the task of grouping existing texts in a sentence or document then determine the opinion that stated in the document whether it is positive, negative or neutral [3].

Sentiment measurement in open source information is currently an active research area [4].

The application of machine learning method is used to classify the polarity of a news story from a very large data source. To do so, can use one of the functions of text mining, in this case is the classification of documents [5]. According to Turban et al explained that text mining has similarities with data mining. Both have the same goal of obtaining information and knowledge from a very large set of data. The data can be in the form of a database. But both have different types of data. Data mining has data input from structured data while text mining starts with unstructured data [6]. Text mining is the application of the concept of data mining techniques to look for patterns in text, aiming to find useful information with a specific purpose [7]. Text mining can be processed for a variety of purposes including summarization, text document search and sentiment analysis. In the research that has been done on sentiment analysis, there is research on sentiment analysis of gojek and grab app user reviews. In the study, using the method used is SVM PSO as an update fiturnya in addition to using the TF-IDF feature. The data that has been obtained is labeled positive and negative and then corrected by the linguist. The data used in this study as much as 1,380 data divided into two types of data with a ratio of 70%

(3)

analysis of sentiment towards online learning during the covid-19 period using a support vector machine algorithm based on particle sarm optimization with an accuracy value of 71.39%. vector machine (SVM) can be a solution to improve the accuracy and AUC of analyzing public sentiment regarding online learning during the covid-19 period [8]. In this study, we will discuss the steps that go through to conduct a process of sentiment analysis of comments about flip applications on Google Play. Starting from the preprocessing stage to the sentiment analysis stage with Naive Bayes Classifier dan Support Vector Machine with smote technique.

METHOD

System analysis has the task of grouping existing texts in a sentence or document then

determine the opinion that stated in the document whether it is positive, negative, or neutralt [8].

Sentiment analysis has many challenges including the assessment issued in a document

addressed to the subject or object, whether the opinion expressed a positive or negative opinion, in addition to the strength of one’s opinion and the target should be sought outside the given sentence [9].

Text Mining

Text mining is the process of conducting a knowledge search that focuses on the data that is in the document or text with the aim of extracting the information and identifying it [10]. The forms that appear in text mining are usually complex and incomplete text structures, unclear meanings, not standards and different [11].

Support Vector Machine (SVM)

SVM is a classification method for linear and nonlinear data.In short, an SVM is an algorithm that works using mapping nonlinearity to convert the original training data to a higher dimension.

In this new dimension, he looks for a hyperplane that separates linear optics (that is, the

”decision boundary” separates tuples of one class from another). With precise nonlinear mapping to sufficiently high dimensions, data from two classes can always be separated by a hyperplane.

SVM is trying to find hyperplane using support vectors (”essential” tuple training) and margin (determined by the support vector)[13].

Naïve Bayes

Naïve Bayes (NB) is a simple probability classification based on Bayes’ theorem. Bayes’

theorem is combined with ”Naïve” which means that each attribute or variable is independent.

Nave Bayes Classifier can be trained efficiently in supervised learning, the Nave Bayes Classifier process assumes that the presence or absence of a feature in a class is not related to the presence or absence of other features in the same class. Naïve Bayes Classifier is included in supervised learning, so that at the learning stage, initial data is needed in the form of training data to be able to make decisions [14].

(4)

SMOTE Technique

SMOTE is a technique for balancing the distribution of sample data in the minority class by selecting the sample data until the number of sample data is balanced with the number of samples in the majority class [6]. The use of the SMOTE method allows for overfitting. Overfitting can occur because the data in the minority class is duplicated so that the same training data is possible. The stages in doing SMOTE start from calculating the distance between data on minority data [15].

In this study, flip application user review data used from scraping results is complete user commentary data consisting of the title and content of the comments, while incomplete comment data will be deleted. In addition, the number of words in the comments ranges from 100 words to 1000 words. Where user comments reviews are taken as much as 440 data from the last 3

months. The news document is then subjected to objective and subjective labeling which is done manually by linguists to classify flip application user reviews whether the comments are positive or negative.

RESULT AND DISCUSSION

This stage is to determine the object of research, in this study the dataset used is a collection of text commentary data from the Flip application from Google Play. Understanding of the object of research is done by digging up information through several reviews from user comments of the Flip application. The motivation in this phase is that the news presented is usually in the form of text on digital media which is grouped based on the discussion title of each comment category.

Sentiment analysis is carried out to find a classification method that can help in determining the titles of positive and negative news articles. At this stage, an understanding is also carried out for the best algorithm method, the algorithm to be used by Support Vector Machine (SVM) and Na¨ıve Bayes (NB) as well as adding the SMOTE technique of classification method used in local datasets.

Data understanding

At the data understanding stage, the process of taking raw data is carried out according to the required attributes. Setting up user comment data taken from google play. The data taken is 1000 data, the process carried out is cleaning data from attributes by selecting the important attribute, namely the text comment attribute. Plus one status attribute to create a label from the selected attribute. All the review data are grouped into one either positive news or negative news and stored in the form of an .xlsx extension. Based on this, the classification model approach in reviewing flip application user comments will use the Support Vector Machine (SVM) and Na¨ıve Bayes (NB) algorithms as well as adding the SMOTE technique to improve the accuracy of the classification method used in the flip application review text dataset.

Data Preparation

Data Preparation is conducting the data preparation stage for the text mining process to retrieve the data from user reviews of the Flip application. The data structure taken from Instagram consists of several columns, namely the account name and the comment text used in

(5)

this problem limitation, only comment text and the addition of a new field, namely status to be used as a class or label. Before being calculated using the proposed model, the data is cleaned so that the data is in accordance with the calculation method by eliminating noise data and inconsistent data.

Transform Cases

Changes the capitalization of characters (letters) to lowercase for all words or letters. After going through the stage of transform cases, all the contents of the document become non-capital.

Furthermore, it will be processed at the tokenization stage Tokenization

The tokenization process is carried out after the transform cases. All unnecessary characters will be discarded. Includes redundant white space and all punctuation. This process will be carried out on every document entered from the document collection. So that the word is obtained that is unique and can represent the document.

Filter Token (By Length)

The results of the stopword filter process (dictionary) are followed by the Filter Tokens (by Length) process. This process the words that have a character length of less than 4 and more than 25 will be deleted, such as the words di, ada, by, which are words that do not have its meaning if it is separated from other words and is not related to adjectives related to sentiment.

Filter Stopword (Dictionary)

This stopwords stage will complete the token by length filter stage. Words that consist of more than 3 letters and are included in stopwords will be discarded. Because the word does not reflect the contents of the document even though it often appears.

Stemming

All words that have been selected to become tokens in the previous stage, will be converted into the root (origin) form of the word.

Modeling Stages

At this stage, the algorithm used is determined and performs data analysis based on the predetermined algorithm. For this study, the Support Vector Machine and Na¨ıve Bayes algorithms were used as a measuring tool for the comparison of the level of accuracy and the AUC value

Testing Stages

Based on the dataset obtained from the pre-processing process, a process design will be used in this study. Here’s the model classification in this study that uses the naive Bayes algorithm as

(6)

one of the classifiers in this study. The following is the design of the model in applying in RapidMiner 8.1 tools with the following process design:

FIGURE 1. Testing Stages

Figure 1 Explains the process design in the SVM and NB cross-validation operators. In this test, the data used is clean data that has gone through preprocessing. The data is taken from the Read Excel operator, this is done because the dataset is stored in Excel format (.xlsx). Process documents from files to convert files into documents. Process validation consists of training data and testing data.

Accuracy and AUC value

The results of the evaluation using the NB Algorithm and the SVM with the smote technique using 440 filtered comment data obtained through a comparison of the results of the calculation of the Accuracy and AUC values for the classification of the data mining used. The results of the evaluation can be seen through the following table:

TABLE 1. Accuracy and AUC

Accuracy AUC

Naïve Bayes 64,55% 0.502

SVM 65,00% 0.786

Naive Bayes + Smote 69,78% 0.506

SVM + Smote 73,48% 0.836

(7)

Based on the evaluation results from the testing process of the SVM and NB algorithm models without features as well as SVM and NB using the SMOTE technique, the highest model test results from all algorithm testing results are SVM based on SMOTE. Therefore, the weights that will be used in modeling the application are based on the test results of the SVM based on the SMOTE technique.

CONCLUSION

Based on research that has been done using two classifications of data mining, namely the NB algorithm and SVM which is then evaluated with AUC (Area Under Curve) using flip application user comments review data, it can be proven by the accuracy and AUC values of each algorithm. used NB has an accuracy value of 64.55% with an AUC of 0.502 while NB with the smote technique obtains an accuracy of 69.78% with an AUC of 0.506. While the SVM has an accuracy value of 65.00% with an AUC of 0.786, while the SVM with the smote technique has an accuracy value of 73.48% and an AUC of 0.836. In this research, it can be seen that the level of accuracy obtained by the SVM algorithm with the smote technique is superior which has the highest accuracy value compared to the Naive Bayes algorithm. The results were obtained from testing the NB, SVM, NB+SMOTE, and SVM+SMOTE data. It can be seen that SVM+SMOTE has higher accuracy and AUC compared to NB+SMOTE. So it can be concluded that the use of the Smote Technique on the SVM model can be a solution to improve the accuracy and AUC of sentiment analysis of Flip application users towards comments compared to the NB algorithm.

REFERENCES

1. Lu Y Yang S Chau P Y K and Cao Y, 2011 Dynamics between the trust transfer process and intention to use mobile payment services: A cross-environment perspective Inf. Manag. 48, 8 p. 393–403

2. Buntoro G A, 2017 Analisis Sentimen Calon Gubernur DKI Jakarta 2017 Di Twitter INTEGER J. Inf. Technol.

1, 1 p. 32–41.

3. Dhande L L and Patnaik G K, 2014 Analyzing Sentiment of Movie Review Data using Naive Bayes Neural Classifier Int. J. Emerg. Trends Technol. Comput. Sci. 3, 4 p. 313–320.

4. Nurhuda F Widya Sihwi S and Doewes A, 2016 Analisis Sentimen Masyarakat terhadap Calon Presiden Indonesia 2014 berdasarkan Opini dari Twitter Menggunakan Metode Naive Bayes Classifier J. Teknol. Inf.

ITSmart 2, 2 p. 35

5. Aditia Rakhmat Sentiaji A M B Sarjana P S Statistika D Matematika F Ilmu D A N and Alam P, 2014 Analisis Sentimen Terhadap Acara Televisi Berdasarkan Opini Publik J. Ilm. Komput. dan Inform.

6. Anjani D, 2015 Analisis Kemiripan Dokumen Tugas Akhir Untuk Penilaian Originalitas Bandung 30, 3 p.

243–250.

7. Hermanto Kuntoro A Y Asra T Pratama E B Effendi L and Ocanitra R, 2020 Gojek and Grab User Sentiment Analysis on Google Play Using Naive Bayes Algorithm and Support Vector Machine Based Smote Technique J. Phys. Conf. Ser. 1641, 1.

8. Saputra N Adji T B and Permanasari A E, 2015 Analisis Sentimen Data Presiden Jokowi dengan Preprocessing Normalisasi dan Stemming Menggunakan Metode Naive Bayes dan SVM J. Din. Inform. 5, November p.

12.

9. Hermanto and Noviriandini A, 2021 ANALISA SENTIMEN TERHADAP BELAJAR ONLINE PADA MASA COVID-19 MENGGUNAKAN ALGORITMA SUPPORT VECTOR 5, 1 p. 129–136.

10. Slamet C Atmadja A R Maylawati D S Lestari R S Darmalaksana W and Ramdhani M A, 2018 Automated Text Summarization for Indonesian Article Using Vector Space Model IOP Conf. Ser. Mater. Sci. Eng. 288,

1.

(8)

11. Puspitasari A M Ratnawati D E and Widodo A W, 2018 Klasifikasi Penyakit Gigi Dan Mulut Menggunakan Metode Support Vector Machine J. Pengemb. Teknol. Inf. dan Ilmu Komput. 2, 2 p. 802–810.

12. Santoso V I Virginia G and Lukito Y, 2017 Penerapan Sentiment Analysis Pada Hasil Evaluasi Dosen Dengan Metode Support Vector Machine J. Transform. 14, 2 p. 72.

13. Anggraini R A Widagdo G Budi A S and Qomaruddin M, 2019 Penerapan Data Mining Classification untuk Data Blogger Menggunakan Metode Na¨ıve Bayes J. Sist. dan Teknol. Inf. 7, 1 p. 47.