Sentiment and Discussion Topic Analysis on Social Media Group using Support Vector Machine

(1)

Sentiment and Discussion Topic Analysis on Social Media Group using Support Vector Machine

Salsabila Putri Adityani^*, Donni Richasdy, Widi Astuti

Fakultas Informatika, Program Studi Informatika, Universitas Telkom, Bandung, Indonesia

Email: ^1,*salsaadityani@student.telkomuniversity.ac.id, ²donnir@telkomuniversity.ac.id, ³widiwdu@telkomuniversity.ac.id Email Penulis Korespondensi: salsaadityani@gmail.com

Abstract−The growth of social media in this modern era is increasingly rapid, where people are very active digitally interacting with each other. People who have a common interest or simply like to be in a community often gather in an online group, especially on Facebook. Alumni of Telkom University are no exception, who are also actively discussing and sharing information in Telkom University Alumni Forum Facebook group (FAST). By using their status from that group, sentiment and topic discussion analysis can be performed to determine whether the polarity is positive, neutral, or negative. In Addition, topic modeling extracts what topics are often discussed in the group. In this research, sentiment analysis was performed using the Support Vector Machine (SVM) method. Also, the classification process involved TF-IDF for word weighting and confusion matrix as performance measurement. Several testing scenarios were carried out to get the best accuracy value. Based on the tests performed on the preprocessing technique and feature extraction n-gram addition, the highest accuracy value obtained is 80.56%. The result indicates that the best performance is obtained by combining preprocessing techniques without the stopword removal process and feature extraction unigram. Moreover, the topics discussed based on topic modeling results were related to telecommunication and Telkom, Indonesia, alumni, and FAST.

Keywords: Telkom University; Facebook; Sentiment Analysis; Support Vector Machine; Topic Modeling

1. INTRODUCTION

In today's modern era, social media is growing rapidly along with the development of the internet and technology.

People often share opinions or responses about an event, product, activity, or someone through social media, which results in interaction between users. Those who share common interests or simply like to be in a community often come together and create groups online, especially on Facebook. The reason is that Facebook social media provides convenience for its users to create a group with features that can be customized according to their needs.

Facebook is a popular social network and is widely used by Indonesian and global people. According to various online sources, one of them is statista [1], statistically, as of January 2022, Facebook is still ranked first as the most popular social network in the world and also in Indonesia with the highest number of active users. Related to that, people often share information, criticism, and opinions about something or someone to the public through social media. As a result, Facebook has a large and diverse dataset that can be used to overcome the limitations of lab-based studies by providing access to records of user behavior expressed in a natural environment [2] so that researchers no longer need to rely on traditional survey methods.

Referring to the Ministry of Education and Culture policy regarding the Tracer Study program, every university is required to conduct an alumni survey that aims to measure the performance of the college in shaping its students to be ready to enter the working world as well as input for evaluation. For this reason, sentiment analysis can be done as an attempt to conduct the survey without having to interact directly with alumni. Sentiment analysis is one of the important areas of research in social media analysis because it concentrates on detecting the polarity of opinions or emotions from texts on social media [3].

Research on sentiment analysis and topic modeling has been carried out by several researchers. In a study conducted by Handayani et al in 2020, The Support Vector Machine algorithm is used to classify positive and negative sentiments towards comments from BNI Mobile Banking application users. By applying the K-Fold Cross-Validation method, the highest accuracy value was originally 78.19% increased to 78.45%. 10-Fold Cross- Validation is also used because the method has become the standard validation method of previous research [4].

In another study conducted by Jaman et al in 2019, sentiment analysis was performed on tweets discussing online motorcycle taxi services collected from Twitter using the SVM method with the TF-IDF feature selection.

The dataset is divided into three classes: positive, neutral, and negative sentiment. The classification process is carried out using several scenarios of comparing train and test data, which are 50:50, 60:40, 70:30, 80:20, and 90:10. The four kernels used in the classification process are linear, RBF, sigmoid and polynomial. The highest accuracy results were obtained in the scenario of 90% of the train data and 10% of the test data using linear and sigmoid kernels of more than 80% [5].

Moreover, in the research conducted by Kumari et al in 2017, SVM is applied to a dataset containing smartphone product reviews to determine the polarity of the sentiment, whether it is positive or negative. The highest accuracy value of 90.99% was obtained, and it was also mentioned that SVM is a better and robust method [6].

Another study, performed by Najadat et al in 2018, sentiment analysis of customer status on the official Facebook pages of 3 Jordanian telecommunications companies using and comparing several supervised learning methods, which are K Nearest Neighbors, Support Vector Machine, Naïve Bayes, and Decision Tree. The results

(2)

DOI: 10.30865/mib.v6i3.4233

obtained are that SVM is the most superior compared to the other three methods in terms of accuracy and F- measure in each experimental scenario [7].

Furthermore, in the research performed by Rahmadan et al in 2020, sentiment analysis and topic modeling were carried out on a collection of Indonesian-language tweets discussing the flood disaster in Jakarta. Researchers implemented a lexicon-based sentiment analysis that resulted in 10% positive sentiment, 11% neutral sentiment, and 79% negative sentiment. Then, topic modeling is carried out using the LDA method. Nine topics were obtained consisting of the distribution of words that generally contained information about flood areas, the impact caused by floods, conditions when the disaster occurred, and input from the people to the government regarding flood disaster management [8].

In this study, the sentiment analysis model was built to classify the polarity of discussions from alumni on the Telkom University Alumni Forum (FAST) Facebook group using the Support Vector Machine (SVM) method because the method proved efficient in overcoming the problem of imbalanced data in sentiment analysis [3]. In addition, based on previous studies, the SVM algorithm can produce the highest confusion matrix value compared to other classification algorithms. Proven by the result of research by Jaman et al [5] that using SVM can produce a high accuracy of more than 80%. The topics discussed can be modeled with the Latent Dirichlet Allocation (LDA) method. With this, it is hoped that the model built will be able to classify alumni sentiments correctly and accurately so that it can be input as an evaluation for the university to develop the quality of its services and maintain its title as the best private university.

2. RESEARCH METHODOLOGY

2.1 System Flow Design

The alumni’s sentiment and discussion topics analysis system has several stages:

a. Data collecting aims to gather a set of data for training and testing the machine learning model.

b. Data labeling is where each of the data is labeled positive, neutral, or negative manually.

c. Text preprocessing stage consisting data cleaning to clean the data from marks, symbols, numbers, etc, then case folding to change all the letters into lowercase, tokenizing to break the sentences into words, normalization to turn the non-standard words into their original words, stemming to remove the affix from the words, and stopword removal to erase the words which considered meaningless.

d. Splitting data into training and testing data.

e. Word weighting using TF-IDF and N-gram to extract the features.

f. Training the model with the Support Vector Machine algorithm.

g. Evaluating the performance of the model with confusion matrix.

h. The topic modeling is carried out with the Latent Dirichlet Allocation to extract topics from the data.

i. Lastly, Topic visualization is performed to visualize the topics for further analysis.

The design of the generally built system is shown in figure 1.

Figure 1. Overall system flow 2.2 Dataset

The dataset used in this study amounted to 481 statuses containing discussions of Telkom University alumni collected from the Telkom University Alumni Forum (FAST) Facebook group from August 27, 2018, to February

(3)

19, 2022. Data collection is performed through crawling by utilizing Selenium tools. Henceforth, each of these statuses will be labeled with three classes, namely positive (1), neutral (0), and negative (-1). The data labeling process requires three people to label each data in the dataset to reduce bias. For example, the first person labels positive, the second person labels negative, and the third person labels positive, then the data is labeled with the most results, which is positive. The purpose of this process is to train machine learning models to make predictions in the classification process. An example of data that has already been labeled can be seen in Table 1.

Table 1. Labeled data example

No Status Label

1. Halo kakak, rekan, teman-teman. Apakah kita grup whatsapp atau telegram? 0

2. Jadi cuma begini ya Pemilu yg megah itu ☺ -1

3. Alhamdulillah.. turut bahagia dan berbangga atas pengukuhan gubes Prof Dr Suyanto, ST, M.Sc 1 2.3 Preprocessing Text

Data preprocessing is the first step in text processing [4]. This step aims to prepare the text on the dataset before entering the following process by changing the text into a better form so that the resulting information has good quality [9]. Preprocessing techniques applied in this study include data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal [10].

a. Data cleaning is the process of cleaning text by removing punctuation marks, numbers, symbols, emoticons or emojis, and URL links.

b. Case Folding is the next process that aims to convert each letter into the same form, i.e., lowercase.

c. In the Tokenizing process, the breakdown of sentences into parts of words or called tokens, is carried out [11].

This process aims to simplify the text into a concise input for the classification process [12].

d. Then, normalization proceeds to turn non-standard words into standards and abbreviations or acronyms into their original words [9].

e. The next step is stemming. The purpose is to remove the affix from the word and return it to the base word [4].

This process is carried out by using Sastrawi as an Indonesian stemming algorithm.

f. The last preprocessing step is stopword removal. In this step, the removal of common words that are considered meaningless and the number of occurrences is large in the text [12].

An overview of this preprocessing stage is shown in Table 2.

Table 2. Preprocessing stage

Preprocessing Result

Raw data Jadi Cuma begini ya Pemilu yg megah itu ☺ Data cleaning Jadi Cuma begini ya Pemilu yg megah itu Case folding jadi cuma begini ya pemilu yg megah itu

Tokenizing ‘jadi’, ‘cuma’, ‘begini’, ‘ya’, ‘pemilu’, ‘yg’, ‘megah’, ‘itu’

Normalization ‘jadi’, ‘cuma’, ‘begini’, ‘iya’, ‘pemilihanumum’, ‘yang’, ‘megah’, ‘itu’

Stemming ‘jadi’, ‘cuma’, ‘begini’, ‘iya’, ‘pemilihanumum’, ‘yang’, ‘megah’, ‘itu’

Stopword removal ‘pemilihanumum’, ‘megah’

2.4. Term Frequency-Inverse Document Frequency (TF-IDF) Word Weighting

After preprocessing, the data is ready to be processed at the next stage, weighting with TF-IDF. Term Frequency- Inverse Document Frequency, commonly known as TF-IDF, is a method of determining the weight of a word by giving different weights to each word in a document based on the frequency of words per document and the frequency of words in all documents [13]. The first step in this process is to calculate the frequency of appearance of a word in a document (TF) with the equation (1).

𝑡𝑓_𝑡= 1 + 𝑙𝑜𝑔 (𝑡𝑓_𝑡) (1)

Where tf_t is the number of occurrences of the word t.

Next, the calculation of the number of documents containing a certain word is carried out, and then calculated its inverse (IDF) [13] with equations (2).

𝑖𝑑𝑓_𝑡= 𝑙𝑜𝑔 (^𝐷

𝑑𝑓𝑡) (2)

Note:

idf_t : inverse document frequency

D : the number of all existing documents df_t : number of documents containing the word t

The last step in this process is to calculate the TF-IDF value by multiplying the TF result by the IDF calculation result [13] with equations (3).

(4)

DOI: 10.30865/mib.v6i3.4233

𝑊_𝑡,𝑑= 𝑡𝑓_𝑡× 𝑖𝑑𝑓_𝑡 (3)

Note:

W_t,d : weight of the word t in document d tf_t : number of occurrences of the word t

idf_t : inverse document frequency containing the word t 2.5. N-gram Feature Extraction

N-gram is a piece of n-word taken from a string and the word will be separated based on the order of the words in a particular sentence [14]. In this study, the n-grams used were n=1 or also called unigram and n=2 or bigram. An example of unigrams and bigrams application can be seen in table 3.

Table 3. N-gram application example

N-gram Hasil

Unigram “turut”, “bahagia”, “dan”, “berbangga”

Bigram “turut bahagia”, “bahagia dan”, “dan berbangga”

2.6 Classification with Support Vector Machine (SVM)

Data that has been weighted will begin to be classified with the Support Vector Machine (SVM) algorithm. The concept of this classification method is to find the best hyperplane by taking hyperplane measurements at the margin so that the maximum point is found [15]. The hyperplane is a one-dimensional subspace that smaller than the surrounding space and is used to separate data when there are three dimensions or more [16]. SVM is a non- linear classification algorithm that operates in a vector space whose dimensions are larger than the original feature space of the given dataset [3]. Therefore, SVM provides a kernel function feature that consists of linear, polynomial, RBF, and sigmoid [17]. This is the reason this classification method was chosen because, in some cases, sentiment analysis, the data processed is not always linear. Here are the equations of each kernel function [17].

a. Linear

𝐹(𝑥, 𝑥^′) = 𝛴(𝑥. 𝑥^′) (4)

b. Polynomial

𝐹(𝑥, 𝑥^′) = [𝛾(𝑥. 𝑥^′) + 𝑟]^𝑑 (5)

c. RBF

𝐹(𝑥, 𝑥^′) = 𝑒𝑥𝑝 (−𝛾 ∗ ||𝑥 − 𝑥^′||²) (6)

d. Sigmoid

𝐹(𝑥, 𝑥^′) = 𝑡𝑎𝑛ℎ [𝛾(𝑥. 𝑥^′) + 𝑟] (7)

Note:

x, x^′ : data to be classified γ : gamma, values from 0 to 1 d : degree, on polynomial kernels r : constant

2.7 Evaluation

The performance of the built model can be evaluated with several parameters such as accuracy, precision, recall, and f-measure or also known as the Confusion Matrix. [18] This evaluation method refers to True Positive (TP) which means that the data is correctly predicted positive, False Positive (FP) which means the data is incorrectly predicted positive, True Negative (TN) which means the data is correctly predicted negative, and False Negative (FN) which means that the data is incorrectly predicted negative. From these four terms, the value of the Confusion Matrix can be calculated with the following equation [18]:

a. Accuracy is a statistical measurement to measure how good the model is at classifying correctly 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^{𝑇𝑃+𝑇𝑁}

𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁× 100% (8)

b. Precision shows the ratio of the amount of data correctly labeled positive to the total amount of data predicted to be positive

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ^𝑇𝑃

𝑇𝑃+𝐹𝑃 (9)

c. Recall shows the ratio of the amount of data classified as positive to the total data that is actually positive

(5)

𝑅𝑒𝑐𝑎𝑙𝑙 = ^𝑇𝑃

𝑇𝑃+𝐹𝑁 (10)

d. F-Measure is a comparison of average precision and recall 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 (11)

2.8 Latent Dirichlet Allocation (LDA) Modeling

The Latent Dirichlet Allocation (LDA) method was used to model the topic of discussion in this study. LDA is an unsupervised learning method that is able to explore and produce topics from a large number of documents so that it is possible to identify which topic composition is most appropriate that represents the content of each document [8]. The concept of this method is that a document consists of several topics, and each topic consists of a distribution of words [19]. Once the LDA model is built, topics are then visualized using the pyLDAvis library to make analysis easier.

3. RESULT AND DISCUSSION

In this study, classification testing of sentiment from Indonesian texts was performed using the Support Vector Machine (SVM) algorithm with linear kernel. There are 481 data in the dataset grouped into three classes, which are -1, 0, and 1, each of them representing negative, neutral, and positive sentiments. From the data labeling process, 369 data with neutral sentiment, 72 data with positive sentiment, and 40 data with negative sentiment were obtained. A large number of neutral sentiments is due to the majority of the status in the FAST group talking about job vacancies information, invitations to an event, and news intended for alumni. The test scenario focused on the preprocessing stage and the addition of n-gram feature extraction. The first scenario aims to find out the combination of preprocessing techniques that can produce the best accuracy value. Then, the second scenario aims to determine the influence of the application of unigrams and bigrams on the performance of the classification model. Each test scenario was performed hyperparameter tuning using the GridsearchCV method with 10-Fold- Cross-Validation, which aims to get the best combination of SVM hyperparameters so that the results obtained are maximized.

3.1 The Effect of Preprocessing Testing Result

In this scenario, testing is carried out at the preprocessing stage by experimenting on the effects of each preprocessing technique. A complete preprocessing technique (consisting of data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal process) was carried out for the first test, then testing without removing stopword, then eliminating the normalization, stemming, and stopword removal processes, and finally testing without preprocessing. The model uses different split data sizes, which are 70% train data 30% test data and 80% train data 20% test data which aims to observe the effect of the ratio of split data to the confusion matrix value. In addition, the extraction of TF-IDF and unigram features is applied in this sentiment classification model. The results of the first scenario can be seen in Table 4.

Table 4. The result of preprocessing test Preprocessing

Split Size (train data:test

data)

Accuracy Precision Recall F1- Measure

Complete Preprocessing 70:30 78.47 81.59 44.82 48.26

80:20 79.17 59.78 38.15 38.41

Without stopword removal 70:30 78.47 81.59 44.82 48.26

80:20 79.17 59.78 38.15 38.41

Only data cleaning, case folding, dan tokenizing

70:30 75.17 47.40 40.12 39.57

80:20 75.26 58.25 39.39 38.77

No Preprocessing 70:30 75.86 80.92 41.78 42.86

80:20 76.29 91.84 41.62 43.11

Based on the results of the experiments in Table 4, it can be concluded that the size of the split data and different preprocessing techniques have a not very significant influence on accuracy but quite a noticeable influence on other confusion matrix parameters; precision, recall, and F1-measure. Furthermore, hyperparameter tuning is applied to search for the best combination of SVM parameters to get maximum results. The best parameters results are shown in Table 5.

Table 5. Best SVM parameters based on gridsearchCV result

Preprocessing Split Size

(train data:test data) Best SVM parameters

Complete Preprocessing 70:30 C=10, gamma=0.4, kernel=’rbf’

(6)

DOI: 10.30865/mib.v6i3.4233

Preprocessing Split Size

(train data:test data) Best SVM parameters 80:20 C=10, gamma=0.1, kernel=’rbf’

Without stopword removal 70:30 C=10, kernel=’linear’

80:20 C=10, kernel=’linear’

Only data cleaning, case folding, dan tokenizing 70:30 C=10, kernel=’linear’

No Preprocessing 70:30 C=10, kernel=’linear’

80:20 C=10, gamma=0.1, kernel=’rbf’

The effect of the best SVM parameters application to the performance measurement with confusion matrix can be seen in Table 6.

Table 6. The result of preprocessing test after hyperparameter tuning Preprocessing

Split Size (train data:test

data)

Accuracy Precision Recall F1- Measure

Complete Preprocessing 70:30 78.47 77.84 47.84 51.97

80:20 80.21 71.41 50.56 52.15

Without stopword removal 70:30 80.56 78.31 54.72 59.80

80:20 79.17 62.63 62.08 56.93

Only data cleaning, case folding, dan tokenizing

70:30 75.86 63.66 47.20 50.69

80:20 78.35 73.66 50.37 55.15

No Preprocessing 70:30 75.17 61.61 45.54 48.60

80:20 77.32 73.15 46.40 49.73

Based on the experimental results in Table 6, hyperparameter tuning using GridsearchCV by applying 10- Fold-Cross-Validation was able to increase the confusion matrix value. The occurrence of a decrease in value is caused by customization parameters in the code that do not represent the entire possible hyperparameter combination from SVM, as a result, the selection of a new hyperparameter combination is worse compared to the default hyperparameter. However, overall, the implementation of this method can increase the accuracy, precision, recall, and F1-measure values.

For this testing of the effect of preprocessing technique, the highest accuracy value was obtained by preprocessing experiments without stopword removal, split data size of 70% train data and 30% of test data which resulted in a value of 80.56%. In contrast, the smallest accuracy is in experiments without preprocessing and split data size of 70% train data and 30% test data with a value of 75.17%. This is due to the magnitude of the influence of each preprocessing technique, especially the normalization and stemming processes, so it is essential to do. Still, for stopword removal, it is better not to do it. In the process of normalization, words such as "tdk", "yg", and

"nggak" will be changed to "tidak", "yang", and "tidak". If the words are not normalized, they will be considered different words even if they have the same meaning (“yg” will have a different meaning from “yang”) and resulting in the system misclassifying the sentence. Then, with the stemming process, each word in the sentence will be changed to its original form. The effect is that the word that has the affix will be considered the same as the base word, for example, "mendaftarkan" will be returned to its basic word, which is "daftar". If the stemming process is not carried out, then the words "mendaftarkan" and "daftar" will be considered as different words, similar to the influence of the non-implementation of the normalization process, that is the system will misclassify. As for the stopword removal process is better not to do it because it will result in the loss of information from a sentence that affects the classification process. For example, the phrase "tidak kreatif" whose meaning is more negative, then the stopword removal process will eliminate the word "tidak" and the remaining word "kreatif" which has a meaning tends to be positive so that the system will classify the sentence in a positive class instead of negative class. These results are also supported by low accuracy in preprocessing tests that only apply data cleaning, case folding, and tokenizing techniques compared to accuracy in the tests using complete preprocessing.

3.2 The Effect of N-gram Feature Extraction Testing Result

In the second scenario, the test was performed by applying the addition of n-gram feature extraction with complete techniques of preprocessing stage and the classification process by the SVM method using a linear kernel. The purpose of this experiment is to determine the influence of unigrams and bigrams on the value of the confusion matrix generated by the model. The results of the tests in this scenario can be seen in Table 7.

Table 7. The result of n-gram test N-Gram Split Size

(train data:test data) Accuracy Precision Recall F1-Measure

Unigram 70:30 78.47 81.59 44.82 48.26

80:20 79.17 59.78 38.15 38.41

(7)

N-Gram Split Size

Bigram 70:30 77.08 81.22 42.26 43.90

80:20 78.12 59.50 36.39 35.49

Before further analysis, similar with the previous scenario, with implementing the GridsearchCV algorithm, the best parameters of SVM were obtained. The result of such process can be seen in Table 8.

Table 8. Best SVM parameters based on hyperparameter tuning result N-Gram Split Size

(train data:test data) Best SVM parameters Unigram 70:30 C=10, gamma=0.4, kernel=’rbf’

80:20 C=10, gamma=0.1, kernel=’rbf’

Bigram 70:30 C=10, kernel=’linear’

Then, for the test results after applying the best SVM parameters based on 10-Fold-Cross-Validation, is shown in Table 9.

Table 9. The result of the n-gram test after hyperparameter tuning N-Gram Split Size

Unigram 70:30 78.47 77.84 47.84 51.97

80:20 80.21 71.41 50.56 52.15

Bigram 70:30 77.08 67.48 50.25 53.48

80:20 78.12 60.15 49.66 50.76

Based on the experimental results in table 9, it can be concluded that feature extraction using unigrams resulted in higher accuracy values than bigram both on the scenario split data size ratio 70:30 and 80:20. This is due to the discovery of more single features used in the data train than in the bigram cases where the probability of combining the same two sequential words appearing in the training data is lesser. By applying unigram in the scenario of split data size 80% data train and 20% data test with preprocessing using data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal technique and TF-IDF word weighting, the highest accuracy obtained is 80.21%, means the model is able to correctly classify the data quite well. However, based on the recall value, the ability of the model to find all the positive data is still low.

3.3 LDA Topic Modelling Result

The data that has entered a complete preprocessing stage then continue to the following process, the topic modeling stage by applying the LDA algorithm. With this method, the dataset is extracted, and 10 topics are obtained; each topics have 10 related words. The results of the topic extraction can be seen in Table 10.

Table 10. Topic Extraction Result Topic

no. Word distribution

0 “indonesia”, “telco”, “telekomunikasi”, “iya”, “ismir”, “network”, “online”, “informasi”, “kerja”,

“group”

1 “telekomunikasi”, “alumni”, “daftar”, “indonesia”, “online”, “basalamah”, “fast”, “telco”, “riza”,

“syafiq”

2 “telekomunikasi”, “indonesia”, “wa”, “telkomsel”, “kerja”, “ajar”, “usaha”, “alumni”, “online”,

“bantu”

3 “indonesia”, “telekomunikasi”, “alumni”, “fast”, “telco”, “program”, “telkomsel”, “university”,

“daftar”, “donasi”

4 “telkomuniversity”, “telekomunikasi”, “fast”, “orang”, “alumni”, “indonesia”, “program”, “lomba”,

“university”, “mahasiswa”

5 “telekomunikasi”, “data”, “fast”, “alumni”, “university”, “science”, “telkomuniversity”, “link”,

“acara”, “program”

6 “telekomunikasi”, “alumni”, “daftar”, “indonesia”, “data”, “universitas”, “fast”, “kerja”, “meeting”,

“join”

7 “fast”, “telekomunikasi”, “alumni”, “indonesia”, “universitas”, “join”, “added”, “telcoindonesia”,

“telkomuniversity”, “zoom”

8 “fast”, “telekomunikasi”, “indonesia”, “daftar”, “added”, “votes”, “bantu”, “air”, “jakarta”,

“mustel”

(8)

DOI: 10.30865/mib.v6i3.4233 Topic

no. Word distribution

9 “alumni”, “telekomunikasi”, “indonesia”, “informasi”, “program”, “bangun”, “university”, “kerja”,

“digital”, “data”

Based on the results of the extraction, it can be noticed that there are several words that are often mentioned in every topic. The greatest number of occurrences is occupied by the word "telecommunications" mentioned in 10 topics. The word "telecommunications" that appears is a combination of the word "telecommunications" itself and the word "telkom" which is affected by the normalization process. The next word that has the highest number of occurrences is "indonesia" which is mentioned on 9 topics, followed by "alumni" on 8 topics, and "fast" on 7 topics. The word "fast" here refers to the Telkom University Alumni Forum group. Furthermore, the results of this topic modeling are visualized with an intertopic distance map that describes the relationship between the topic and the term. For visualization of the topic is presented in figure 2.

Figure 2. Topic Visualization

4. CONCLUSION

From the results of the test scenarios that have been carried out related to sentiment and discussion topics analysis in the Telkom University alumni Facebook group, it can be concluded that the best performance of the classification model with the Support Vector Machine (SVM) method is obtained when using a combination of preprocessing techniques without removing stopword with a split data size of 70% train data and 30% test data after hyperparameter tuning using 10-Fold-Cross-Validation which produces accuracy values by 80.56%. This is caused by the influence of the stopword removal process, which can reduce information from data due to the deletion of certain words that are important for the system to learn so that the impact is that the data is misclassified.

On the other hand, the best accuracy value in testing the addition of n-gram feature extraction was obtained by applying unigrams, which is 80.21% in the experiment 80% train data and 20% test data, because while using bigram, the resulting feature comes from two sequential words combined so that the possibility of the appearance of the exact combined words in the data train is very small. The role of other confusion matrix parameters such as precision, recall, and F1-measure are as additional insights related to model performance because, in the case of data imbalances in this study, accuracy measurements alone are not enough to provide information. Furthermore, topic modeling using latent Dirichlet Allocation (LDA) can produce topics discussed by alumni within the group based on the extracted word distribution. In general, the topics discussed are related to telecommunications and Telkom, Indonesia, alumni, and FAST.

REFERENCES

[1] “• Most used social media 2021 | Statista.” https://www.statista.com/statistics/272014/global-social-networks-ranked-by- number-of-users/ (accessed May 10, 2022).

[2] I. Dragan and R. Zota, “Collecting Facebook data for big data research,” 16th Networking in Education and Research RoEduNet International Conference, RoEduNet 2017 - Proceedings, Nov. 2017, doi:

10.1109/ROEDUNET.2017.8123757.

(9)

[3] D. N. Sotiropoulos, G. M. Giaglis, and D. E. Pournarakis, “SVM-based sentiment classification: a comparative study against state-of-the-art classifiers,” International Journal of Computational Intelligence Studies, vol. 6, no. 1, p. 52, 2017, doi: 10.1504/IJCISTUDIES.2017.10007054.

[4] Y. Handayani, A. R. Hakim, and Muljono, “Sentiment analysis of Bank BNI user comments using the support vector machine method,” Proceedings - 2020 International Seminar on Application for Technology of Information and Communication: IT Challenges for Sustainability, Scalability, and Security in the Age of Digital Disruption, iSemantic 2020, pp. 202–207, Sep. 2020, doi: 10.1109/ISEMANTIC50169.2020.9234230.

[5] J. H. Jaman and R. Abdulrohman, “Sentiment Analysis of Customers on Utilizing Online Motorcycle Taxi Service at Twitter with the Support Vector Machine,” ICECOS 2019 - 3rd International Conference on Electrical Engineering and Computer Science, Proceeding, pp. 231–234, Oct. 2019, doi: 10.1109/ICECOS47637.2019.8984483.

[6] U. Kumari, A. K. Sharma, and D. Soni, “Sentiment analysis of smart phone product review using SVM classification technique,” 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing, ICECDS 2017, pp. 1469–1474, Jun. 2018, doi: 10.1109/ICECDS.2017.8389689.

[7] H. Najadat, A. Al-Abdi, and Y. Sayaheen, “Model-based sentiment analysis of customer satisfaction for the Jordanian telecommunication companies,” 2018 9th International Conference on Information and Communication Systems, ICICS 2018, vol. 2018-January, pp. 233–237, May 2018, doi: 10.1109/IACS.2018.8355429.

[8] M. Choirul Rahmadan, A. Nizar Hidayanto, D. Swadani Ekasari, B. Purwandari, and Theresiawati, “Sentiment Analysis and Topic Modelling Using the LDA Method related to the Flood Disaster in Jakarta on Twitter,” Proceedings - 2nd International Conference on Informatics, Multimedia, Cyber, and Information System, ICIMCIS 2020, pp. 126–130, Nov.

2020, doi: 10.1109/ICIMCIS51567.2020.9354320.

[9] S. Khairunnisa, A. Adiwijaya, and S. al Faraby, “Pengaruh Text Preprocessing terhadap Analisis Sentimen Komentar Masyarakat pada Media Sosial Twitter (Studi Kasus Pandemi COVID-19),” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, pp. 406–414, Apr. 2021, doi: 10.30865/MIB.V5I2.2835.

[10] L. G. Irham, A. Adiwijaya, and U. N. Wisesty, “Klasifikasi Berita Bahasa Indonesia Menggunakan Mutual Information dan Support Vector Machine,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 3, no. 4, pp. 284–292, Oct. 2019, doi: 10.30865/MIB.V3I4.1410.

[11] M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing For Student Complaint Document Classification Using Sastrawi,” IOP Conference Series: Materials Science and Engineering, vol.

874, no. 1, p. 012017, Jun. 2020, doi: 10.1088/1757-899X/874/1/012017.

[12] S. Fahmi, L. Purnamawati, G. F. Shidik, M. Muljono, and A. Z. Fanani, “Sentiment analysis of student review in learning management system based on sastrawi stemmer and SVM-PSO,” Proceedings - 2020 International Seminar on Application for Technology of Information and Communication: IT Challenges for Sustainability, Scalability, and Security in the Age of Digital Disruption, iSemantic 2020, pp. 643–648, Sep. 2020, doi:

10.1109/ISEMANTIC50169.2020.9234291.

[13] D. E. Cahyani and I. Patasik, “Performance comparison of TF-IDF and Word2Vec models for emotion text classification,”

Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2780–2788, Oct. 2021, doi:

10.11591/EEI.V10I5.3157.

[14] R. Kustiawan, A. Adiwijaya, and M. D. Purbolaksono, “A Multi-label Classification on Topic of Hadith Verses in Indonesian Translation using CART and Bagging,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 2, pp.

868–875, Apr. 2022, doi: 10.30865/MIB.V6I2.3787.

[15] E. F. Saraswita, D. P. Rini, and A. Abdiansah, “Analisis Sentimen E-Wallet di Twitter Menggunakan Support Vector Machine dan Recursive Feature Elimination,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 4, pp. 1195–

1200, Oct. 2021, doi: 10.30865/MIB.V5I4.3118.

[16] A. Kowalczyk, Support Vector Machines Succinctly. 2017. Accessed: Jun. 20, 2022. [Online]. Available:

https://www.syncfusion.com/succinctly-free-ebooks/support-vector-machines-succinctly

[17] “1.4. Support Vector Machines — scikit-learn 1.0.2 documentation.” https://scikit- learn.org/stable/modules/svm.html#kernel-functions (accessed May 12, 2022).

[18] E. Tyagi and A. K. Sharma, “Sentiment Analysis of Product Reviews using Support Vector Machine Learning Algorithm,”

Indian Journal of Science and Technology, vol. 10, no. 35, pp. 1–9, Jun. 2017, doi: 10.17485/IJST/2017/V10I35/118965.

[19] “Latent Dirichlet Allocation (LDA).” https://socs.binus.ac.id/2018/11/29/latent-dirichlet-allocation-lda/ (accessed May 18, 2022).