IMPLEMENTATION AND RESULTS

(1)

26

5.1. Implementation

5.1.1. Data Collection and Import Library

1. from sklearn.feature_extraction.text import TfidfVectorizer 2. import pandas as pd

3. import numpy as np

4. from sklearn.metrics import classification_report, confusion_matrix 5. import nltk

6. from nltk.corpus import stopwords

7. from sklearn.model_selection import train_test_split 8. from sklearn.svm import LinearSVC

9. from sklearn.metrics import accuracy_score

10. df = pd.read_csv('data/tripadvisor_hotel_reviews.csv',encoding =

"ISO-8859-1")

In line 1, I imported the TF – IDF package from sklearn to do the TF – IDF calculation.

In lines 2 and 3 I imported pandas and NumPy which are useful for reading CSV data, changing data dimensions, and creating arrays. In line 4, I imported the classification report and confusion matrix from sklearn to perform calculations on precision, recall, accuracy, and f-1 score. Followed by lines 5 - 6, namely importing stopwords from the nltk package which is useful for downloading files containing stopword words. Then in line 7, I import train_test_split from sklearn which is useful for dividing the dataset into 2 parts which are used for training data and testing data with 2 certain proportions. In line 8 I imported LinearSVC where I use this package to classify the data that I have processed. In line 9 I imported the accuracy score from sklearn to calculate the accuracy of my classification. On line 14 I created a df variable which is used to hold data from my CSV file which is a dataset with the pandas package and the read_csv function.

5.1.2. Preprocessing Data

11. nltk.download('stopwords')

12. stopwords_list = set(stopwords.words("english")) 13. punctuations = """!()-![]{};:,+'"\,<>./?@#$%^&*_~Â"""

14. def splitReviews(review):

15. splitReview = review.split()

16. parsedReview = " ".join([word.translate(str.maketrans('', '', punctuations)) + " " for word in splitReview])

(2)

27 17. return parsedReview

18. def clean_review(review):

19. clean_words = []

20. splitReview = review.split() 21. for w in splitReview:

22. if w.isalpha() and w not in stopwords_list:

23. clean_words.append(w.lower()) 24. clean_review = " ".join(clean_words)

25. return clean_review

26. df["Review"] = df["Review"].apply(splitReviews).apply(clean_review) 27. print(df.head())

In line 16 I download stopword with the nltk library which I will later use to carry out the stopword removal process. In line 17 I create a stopwords_list variable that is used to hold the value of the stopword words that I use in English and I change it to a data typeset. In line 18 I created a variable punctuation which is used to accommodate punctuation characters that are considered to be a nuisance. Then on lines 20 – 23, I created a function called splitReviews with parameters containing all the reviews in my dataset. Then I did a split on all the review sentences and cleaned the data from the punctuation that I had accommodated in the punctuations variable.

After that, I return to issue the results of the split process and cleaning of punctuation that I have done.

On lines 25 – 32 I created a function called clean_review with a review parameter that contains a review sentence on the previously processed data. Inside the function, I created a variable named clean_words which contains an empty array that I will later use to store the results of the data cleaning but there are still spaces. Then I create a splitReview variable which I use to hold all the words in the review sentence. Then I create a condition to loop on the word data that has been split earlier. After that, in the for a condition I use the if condition to check whether the word is an alphabet and not in the stopwords_list variable where the contents are unimportant words. If these words are included in the criteria in the if condition, they will be entered into the clean_words variable that was created earlier and make all the characters in the word all lowercase.

Then from the results, I separated the words with a punctuation mark " " (space) in the clean_review variable. Then proceed to line 34 where I run the 2 functions that I have created, namely splitReviews and clean_review and store them into df[“Review”]. On the 35th line, I display a little result from the data cleaning that I have made.

(3)

28

28. df['Rating']=df['Rating'].astype(int) 29. conditions = [

30. df.Rating >= 4, 31. df.Rating == 3, 32. df.Rating <= 2, 33. ]

34. values = ['2', '1', '0']

35. df['label'] = np.select(conditions, values) 36. print(df['label'].value_counts())

37. df = df.sample(frac=1).reset_index(drop=True)

At this stage, I divided the sentiments where there were positive, neutral, and negative in accordance with previous research [1]. On the 40th row I changed all the data types in the rating column to integers. On lines 42 – 46 I divide the sentiment rating with 4 upwards being positive, 3 being neutral, and 2 downwards being negative and store it into the conditions variable which has an array of data type according to previous research [1]. In line 48 I create a values variable that holds the values of the positive (2), neutral (1), and negative (0) categories in an array. In the 50th line, I changed the value from the rating data to the value that I created which represented the positive, negative, and neutral values earlier and stored it in df[“label”]. And on line 51 I display the name and data type of the df[“label”]. Continued on the 53rd line, I scrambled the data that I had processed.

5.1.3. Data Extraction

38. print("TFIDF Vectorizer……") 39. vectorizer= TfidfVectorizer()

40. X = vectorizer.fit_transform(df['Review']) 41. print(pd.DataFrame(X))

42. y=df['label']

At this stage I perform feature extraction with the aim of finding significant feature areas.

On line 57 I only show the words "TFIDF Vectorizer ..." just to provide information on this section. In line 58 I put the TfidfVectorizer library in the vectorizer variable which I will then use to carry out the feature extraction process. In line 59 I created a variable X to accommodate the results of the TF – IDF vectorization process from the data review that had previously been preprocessed. In line 61 I display the results of the tf - idf vectorization process using the pandas

(4)

29

library with DataFrame . Then in line 62 I enter the data label containing the rating value which has been divided into positive (2), neutral (1), and negative (0) values into the y variable.

5.1.4. Data Selection (Chi – Square)

43. from sklearn.feature_selection import SelectKBest 44. from sklearn.feature_selection import chi2

45. chi2_features = SelectKBest(chi2, k = 5000)

46. X_kbest_features = chi2_features.fit_transform(X, y) 47. print("CHI SQUARE RESULT")

48. print(X_kbest_features)

In this feature selection, I use chi-square using the sklearn.feature_selection library to import SelectKBest and chi2 as in lines 67 and 68. The purpose of SelectKBest is to rank features in the dataset based on their importance in relation to the target variable. Then proceed to line 69 where I create the chi2_features variable to accommodate the function settings of the SelectKBest function. In the SelectKBest function, I use the chi2 library to set the function, which will then be processed by statistics to find the best features in the next stage. In the SelectKBest function, the K attribute is used to determine the best number of features after the feature selection process is carried out. Above I used K = 3000. I also tried several other hyperparameters namely 1000, 2000, 4000, and 5000. In the 70th line I created the X_kbest_features variable to accommodate the results of the feature selection calculation using the chi2_features variable which contains the function settings that I made earlier. In the fit_transform function there are variables X and y where X is the review data that has been processed in the preprocessing and feature extraction process and y which is the rating data that has been changed to values 2, 1, and 0 which is the result of grouping positive, neutral, and negative in the previous process. In lines 73 and 74, I display the results of the chi-square calculation.

5.1.5. Data Selection (Information Gain)

49. from sklearn.feature_selection import mutual_info_classif 50. mi_score = mutual_info_classif(X,y)

51. print("INFORMATION GAIN") 52. print(mi_score)

53. mi_score_selected_index = np.where(mi_score >-0.3)[0]

54. Xinfo = X[:,mi_score_selected_index]

(5)

30

In the second feature selection, I use information gain. In this process, I also use review data and label data that have been processed during data preprocessing. On line 76 I imported a library called mutual_info_classif which I use to process information gain. In line 77 I created a variable named mi_score to accommodate the results of the information gain calculation process using mutual_info_classif and parameter X is review data and y is label data that has been categorized with positive, negative, and neutral sentiments. In line 78 I only display the word information gain to limit the information gain section. Then on the 79th line I display the results of the calculations that have been carried out on the 77th line. In the 81st line I create a variable named mi_score_selected_index to accommodate the results of the information gain calculation whose value is greater than -0.3. Apart from -0.3, I also tried several hyperparameters to find the best accuracy. The hyperparameters I tried are -0.1, -0.2, and -0.3. Line 82 I accommodate review data with an information gain value of more than -0.3 into a variable named Xinfo.

5.1.6. Classification

55. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 2 )

56. print("Train: ",X_train.shape,y_train.shape,"Test:

",(X_test.shape,y_test.shape)) 57. clf = LinearSVC(random_state=0) 58. clf.fit(X_train,y_train)

59. y_pred=clf.predict(X_test)

In this section, we enter into a classification where the first thing to do here is to divide the data into several parts like in the previous research [1]. Here I only explain the division for 1 part only. In line 85, I divide the training and testing data using the train_test_split function where the first parameter is X data which contains review data that I have processed previously. Then there is the y parameter, which contains Label data, namely rating data that has been converted into sentiment value categories such as positive, negative, and neutral. After that, there is also the test_size parameter which contains the composition of the testing data with a scale of 1. In the example above I use test_size 0.20 which means I give a share for the testing data as much as 20%

of the dataset and 80% for the training data. In this function, there is also a random_state parameter which means a seed with data that produces the same result. Then on line 86 I only display the result of the division of the train_test_split function that I used earlier. In line 88 I create a clf

(6)

31

variable which I use to accommodate the LinearSVC function settings that I use to classify later.

In line 89 I did the classification process with clf, the contents of which are the LinearSVC function settings for my train data, where in the fit function X_train is training data for processed review data and also y_train which is testing data for label data. In line 90, I created the y_pred variable to accommodate predictions from the X_test data, which contained testing data for review data by classifying it with the clf variable as well.

5.1.7. Confusion Matrix

60. print("Confusion Matrix SVM")

61. print(confusion_matrix(y_test,y_pred)) 62. print(classification_report(y_test,y_pred)) 63. acu_svm=accuracy_score(y_test,y_pred)

64. print('AKURASI SVM: %.3f' % acu_svm)

In this section I will calculate accuracy, precision, recall, and f1-score. On line 93, I only display the words "Confusion Matrix SVM" only to indicate the confusion matrix SVM section.

Then on line 94 I display the matrix results from the classification calculations that I have done using the confusion_matrix function. In line 95 I use the classification_report function with the parameter y_test which is the result of the classification using the training data and y_pred which is the result of the prediction of the testing data. In that line I display the precision, recall, f1-score, and accuracy results from the classification process that I have done. In line 98 I create a variable named acu_svm which I use to store the results of accuracy by using accuracy_score with the same parameters as using the previous classification_report function. In the 99th line, I only display the results of the accuracy calculation process using the accuracy_score function as in the 98th line.

5.2. Results

I tested some of the hyperparameters of both of my feature selection methods. To test the hyperparameters I use a data composition with 20% testing data and 80% training data. First I tested the hyperparameter on the chi – square method. In the chi - square method I tried the hyperparameters 1000, 2000, 3000, 4000, and 5000. And get the results below:

(7)

32

Table 5.1. Hyperparameter Test on Chi Square Results Hyperpara

meter Label TP TN FP FN Precis ion

Recal l

F1- Score

Accurac y 1000

Negative 518 3318 133 130 0.80 0.80 0.80

86.6%

Netral 71 3625 64 339 0.53 0.17 0.26 Positive 2960 705 353 81 0.89 0.97 0.93 2000

Negative 513 3317 134 135 0.79 0.79 0.79

86.8%

Netral 83 3626 63 327 0.57 0.20 0.30 Positive 2963 715 343 78 0.90 0.97 0.93 3000

Negative 509 3319 136 135 0.79 0.79 0.79

87.1%

Netral 86 3629 58 326 0.60 0.21 0.31 Positive 2975 721 335 68 0.90 0.98 0.94 4000

Negative 547 3285 135 132 0.80 0.81 0.80

86.5%

Netral 91 3592 72 344 0.56 0.21 0.30 Positive 2909 769 345 76 0.89 0.97 0.93 5000

Negative 509 3323 128 139 0.80 0.79 0.79

86.4%

Netral 72 3621 68 338 0.51 0.18 0.26 Positive 2963 699 359 78 0.89 0.97 0.93

From the table above, it can be seen that the chi-square method using hyperparameter 3000 gets the highest accuracy, which is 87.1%. Therefore, hyperparameter 3000 in the chi-square method is used as a hyperparameter in the chi-square method. Next, I continue by testing the hyperparameters on the Information Gain method. Where the Information Gain try the hyperparameters above -0.1, -0.2, -0.3, -0.4, and -0.5 with the same data composition as the chi square hyperparameter test, which is 20% of the testing data and 80% of the training data. And get the results as below:

Table 5.2. Hyperparameter Test on Information Gain Results Hyperp

aramete r

Label TP TN FP FN Precisi

on Recall F1-

Score Accuracy

>= -0.1

Negative 531 3279 141 148 0.79 0.78 0.79

85.7%

Netral 99 3549 115 336 0.46 0.23 0.31 Positive 2882 783 331 103 0.90 0.97 0.93

>= -0.2

Negative 505 3314 142 138 0.78 0.79 0.78

85.9%

Netral 84 3560 102 353 0.45 0.19 0.27 Positive 2933 747 333 86 0.90 0.97 0.93

>= -0.3

Negative 499 3344 126 130 0.80 0.79 0.80

86.2%

Netral 92 3567 107 333 0.46 0.22 0.29 Positive 2942 721 333 103 0.90 0.97 0.93

>= -0.4 Negative 505 3314 142 138 0.78 0.79 0.78 85.8%

(8)

33

Netral 84 3560 102 353 0.45 0.19 0.27 Positive 2933 747 333 86 0.90 0.97 0.93

>= -0.5

Negative 499 3316 135 149 0.79 0.77 0.78

85.6%

Netral 78 3583 106 332 0.42 0.19 0.26 Positive 2930 707 351 111 0.89 0.96 0.93

From the table above, it can be seen that the Information Gain method using hyperparameters above -0.3 gets the highest accuracy, which is 86.2%. Therefore, the hyperparameter above -0.3 in the Information Gain method is used as a hyperparameter in the Information Gain feature selection method.

From the process that we have done above using the Support Vector Machine algorithm and the Information Gain selection feature as well as the Chi Square selection feature with the aim of increasing the accuracy of previous research [1]. The results obtained for sentiment analysis using the Support Vector Machine algorithm without using the selection feature as in the process in previous studies [1] as follows:

Table 5.3. Support Vector Machine only Results Te

st Label TP TN FP FN Precisi on

Recal l

F1-

Score Accuracy 1

Negative 497 3307 155 140 0.76 0.78 0.77

85.6%

Netral 90 3573 86 350 0.51 0.20 0.29

Positive 2922 728 349 100 0.89 0.97 0.93 2

Negative 778 4952 194 224 0.80 0.78 0.79

85.5%

Netral 131 5343 149 525 0.47 0.20 0.28

Positive 4349 1111 547 141 0.89 0.97 0.93 3

Negative 981 6635 290 291 0.77 0.77 0.77

85.3%

Netral 157 7132 175 733 0.47 0.18 0.26

Positive 5853 1421 741 182 0.89 0.97 0.93 4

Negative 1203 8265 370 408 0.76 0.75 0.76

85.0%

Netral 191 8929 234 892 0.45 0.18 0.25

Positive 7310 1756 938 242 0.89 0.97 0.93 5

Negative 1470 9928 408 489 0.78 0.75 0.77

84.8%

Netral 229 10688 281 1097 0.45 0.17 0.25 Positive 8733 2111 1174 277 0.88 0.97 0.92

From the table above, the Support Vector Machine can produce a fairly good match with a distance between 84.8% to 85.6%. From these results, the use of the Support Vector Machine algorithm without using the selected feature as in previous studies[1] obtained an average accuracy

(9)

34

of 85.24%. It can also be seen that with the addition of training data, the higher the accuracy produced. In the table above, it can be seen that the value of precision and recall on neutral labels is low because after labeling, there are only a few neutral data so that it has no effect on this study.

The five experiments above used the same dataset distribution as the previous research [1] which I wrote in the previous section. I continue by providing the results of sentiment analysis using the Support Vector Machine with the selection feature, namely Chi – Square as below:

Table 5.4. Support Vector Machine feature Selection Chi - Square Results Test Label TP TN FP FN Precisi

on

Recal l

F1-

Score Accuracy 1

Negative 509 3319 136 135 0.79 0.79 0.79

87.1%

Netral 86 3629 58 326 0.60 0.21 0.31 Positive 2975 721 335 68 0.90 0.98 0.94 2

Negative 779 4961 194 214 0.80 0.78 0.79

87.0%

Netral 130 5434 99 485 0.57 0.21 0.31 Positive 4437 1099 509 103 0.90 0.98 0.94 3

Negative 1032 6630 262 273 0.80 0.79 0.79

86.7%

Netral 173 7211 134 679 0.56 0.20 0.30 Positive 5900 1461 696 140 0.89 0.98 0.93 4

Negative 1230 8351 337 328 0.78 0.79 0.79

86.5%

Netral 227 8986 166 867 0.58 0.21 0.31 Positive 7409 1777 877 185 0.89 0.98 0.93 5

Negative 1534 9912 395 454 0.80 0.77 0.78

86.1%

Netral 241 10808 178 1068 0.58 0.18 0.28 Positive 8808 2158 1139 190 0.89 0.98 0.93

The above is the result of sentiment analysis using the Support Vector Machine algorithm and the Chi–Square selection feature. It can be seen in table 5.2 that the accuracy can be said to increase from the first process, namely the Support Vector Machine algorithm without using the selection feature. The accuracy obtained has a distance between 86.1% to 87.1% with an average accuracy of 86.68%.The table above is the same as in the first method, namely the value of precision and recall on a neutral label is low because after labeling, there are only a few neutral data so it has no effect on this study. Same when using the Support Vector Machine algorithm without using the selected feature with the addition of training data, the higher the accuracy produced as well. Then next I give the results of using the Support Vector Machine algorithm with the Information Gain selection feature as follows:

(10)

35

Table 5.5. Support Vector Machine with Information Gain Results Test Label TP TN FP FN Precisio

n Recall F1-Score Accur acy 1

Negative 499 3344 126 130 0.80 0.79 0.80

86.2%

Netral 92 3567 107 333 0.46 0.22 0.29

Positive 2942 721 333 103 0.90 0.97 0.93

2

Negative 750 4998 186 214 0.80 0.78 0.79

86.0%

Netral 146 5340 151 511 0.49 0.22 0.31

Positive 4391 1097 524 136 0.89 0.97 0.93 3

Negative 1014 6624 283 276 0.78 0.79 0.78

85.9%

Netral 167 7143 177 710 0.49 0.19 0.27

Positive 5857 1468 699 173 0.89 0.97 0.93 4

Negative 1217 8289 321 419 0.79 0.74 0.77

85.5%

Netral 107 8956 229 854 0.47 0.20 0.28

Positive 7335 1760 937 214 0.89 0.97 0.93 5

Negative 1473 9968 406 448 0.78 0.77 0.78

85.3%

Netral 230 10714 246 1105 0.48 0.17 0.25 Positive 8785 2101 1155 254 0.88 0.97 0.93

In the table above I show the classification results from the sentiment analysis that I did use the Support Vector Machine algorithm and the Information Gain selection feature. From table 5.3 it can be seen that the accuracy obtained increases from the classification process with the Support Vector Machine algorithm without the selection feature. The accuracy obtained has a distance between 85.3% to 86.2% and has an average accuracy of 85.78%. The table above is the same as in the first and second methods where the precision and recall values on the neutral label are low because, after labeling, the neutral data is only slightly so it has no effect on this study as well. With the addition of training data in this process, the accuracy results obtained are increasing.

Table 5.6. Compare Accuracy of All Methods Results

Test No Feature Selection Chi - Square Information Gain

1 85.6% 87.1% 86.2%

2 85.5% 87.0% 86.0%

3 85.3% 86.7% 85.9%

4 85.0% 86.5% 85.5%

5 84.8% 86.1% 85.3%

MEAN 85.24% 86.68% 85.78%

(11)

36

Table 5.7. Compare F1-Score of All Methods Results

Test F1-Score

No Feature Selection Chi - Square Information Gain

1 0.66 0.68 0.67

2 0.67 0.68 0.68

3 0.65 0.67 0.66

4 0.64 0.67 0.66

5 0.64 0.66 0.65

From the comparison results above, it can be seen that the sentiment analysis classification process using the Support Vector Machine algorithm without using the selection feature has the lowest level of accuracy compared to using the chi-square and information gain selection features. Then, the F1-score value on the neutral label in all methods got a low value, this is because the data after being labeled got little data. Because the formula for precision is the true positive value divided by the true positive value plus the false positive, where the value of true positive on the neutral label gets a low value so the precision value of the neutral label is low.

And in this calculation the chi-square feature selection method is the best method in the sentiment analysis case study that I did with the highest accuracy compared to the other two. After getting the three comparisons, namely not using the selection feature, chi-square, and information gain, it can be concluded that the chi-square has the highest value and is followed by information gain and does not use the selection feature in this study and proves that using the selection feature can add value to the accuracy of previous studies [1].