i
PROJECT REPORT
CHI - SQUARE AND INFORMATION GAIN FEATURE SELECTION FOR HOTEL REVIEW SENTIMENT ANALYSIS USING SUPPORT VECTOR MACHINE
Faculty of Computer Science Soegijapranata Catholic University
2022
NATHANAEL KARUNIA
18.K1.0027
HALAMAN PENGESAHAN
Judul Tugas Akhir: : CHI - SQUARE AND INFORMATION GAIN FEATURE SELECTION FOR HOTEL REVIEW SENTIMENT ANALYSIS USING SUPPORT VECTOR MACHINE
Diajukan oleh : Nathanael Karunia Wibowo
NIM : 18.K1.0027
Tanggal disetujui : 24 Mei 2022 Telah setujui oleh
Pembimbing : Yonathan Purbo Santosa S.Kom., M.Sc Penguji 1 : Yonathan Purbo Santosa S.Kom., M.Sc Penguji 2 : Yulianto Tejo Putranto S.T., M.T.
Penguji 3 : R. Setiawan Aji Nugroho S.T., MCompIT., Ph.D Penguji 4 : Y.b. Dwi Setianto S.T., M.Cs.
Penguji 5 : Rosita Herawati S.T., M.I.T.
Penguji 6 : Hironimus Leong S.Kom., M.Kom.
Ketua Program Studi : Rosita Herawati S.T., M.I.T.
Dekan : Dr. Bernardinus Harnadi S.T., M.T.
Halaman ini merupakan halaman yang sah dan dapat diverifikasi melalui alamat di bawah ini.
sintak.unika.ac.id/skripsi/verifikasi/?id=18.K1.0027
iv
STATEMENT PAGE OF SCIENTIFIC PUBLICATIONS FOR ACADEMIC INTEREST
I, the undersigned:
Name : NATHANAEL KARUNIA
Study Program : Informatics Engineering
Faculty : Computer Science
Type of Work : Thesis
Approve to grant Soegijapranata Catholic University Semarang Non-exclusive Royalty- Free Rights for the scientific works entitled “CHI – SQUARE AND INFORMATION GAIN FEATURE SELECTION FOR HOTEL REVIEW SENTIMENT ANALYSIS USING SUPPORT VECTOR MACHINE” along with existing tools (if needed). With these rights, database form, maintain, and publish this final project as long as it includes my name as the writer/creator and as the copyright owner.
This statement was made truthfully.
Semarang, Juni, 4, 2022
NATHANAEL KARUNIA 18.K1.0027
v
ACKNOWLEDGMENT
All praise and gratitude to God Almighty who has given His grace and grace to the author, so that the author can complete this Final Project Report. The writing of the Final Project Report with the title "Chi Square And Information Gain Feature Selection For Hotel Review Sentiment Analysis Using Support Vector Machine" is intended to achieve a Bachelor's Degree in Computer Science at the Informatics Engineering Study Program, Faculty of Computer Science, Soegijapranata Catholic University, Semarang.
The author realizes that in the preparation of this final report, not only from his efforts but from the guidance of various parties. Therefore, the authors would like to thank all parties who helped in the process of writing this Final Project.
1. My Parents and my family who give the best support during lecture.
2. My Mentor Mr. Yonathan that very kindness and helpful on my thesis.
3. All Lectures of Soegijapranata Catholic University IT 4. My friends who push me to do this thesis.
The author realizes that there may still be shortcomings in this Final Project Report.
Therefore, criticism and suggestions from readers are very useful for the author. Hopefully, this report can be useful for all those who read it.
vi ABSTRACT
In the current era, it has become a trend for people to order tickets online through online booking sites and applications, both in terms of transportation such as planes, vacations such as tours, and also lodging such as hotels. To get a good hotel, you need a review from people who have booked it. With the reviews written by visitors to the site or mobile application, they will then be analyzed so that an output can be produced that can be useful. One of the analytical models that can be done is sentiment analysis. The purpose of this study is to find the best method in analyzing sentiment based on the preprocessing of the data and hopefully it can produce knowledge in the form of sentiment analysis classification methods in order to determine a good method devoted to the data preprocessing section.
The algorithm used to make this sentiment classification analysis is the Support Vector Machine using 3 feature selection methods, namely not using the selection feature, using the chi square selection feature, and using the information gain selection feature. The process consists of five steps in this study, which include several activities. namely data collection, preprocessing, feature extraction, feature selection, classification, and calculating accuracy. In the process of calculating accuracy, I used the Confusion Matrix method to find the best method of the three based on the accuracy results obtained.
The results of the 3 uses of the feature selection method that were carried out were using the chi square feature selection method, the highest results were obtained, namely with an average accuracy of 86.68% which was followed by the use of the information gain selection feature which obtained an average accuracy of 85.78% and the last one was followed by the method not using the selection feature which got an average accuracy of 85.24%. From the results of the three methods, it can be concluded that the use of the chi square feature selection method in the case of sentiment analysis on hotel reviews is the best compared to the other two.
Keyword: feature selection, sentiment analysis, hotel review
vii
TABLE OF CONTENTS
COVER ... i
APPROVAL AND RATIFICATION PAGE ... ii
DECLARATION OF AUTHORSHIP ... iii
STATEMENT PAGE OF SCIENTIFIC PUBLICATIONS FOR ACADEMIC INTEREST iv ACKNOWLEDGMENT ... v
ABSTRACT ... vi
TABLE OF CONTENTS ... vii
LIST OF FIGURE ... x
LIST OF TABLE ... xi
INTRODUCTION ... 1
1.1. Background ... 1
1.2. Problem Formulation ... 2
1.3. Scope ... 2
1.4. Objective ... 3
LITERATURE STUDY ... 4
RESEARCH METHODOLOGY ... 10
3.1. Data Collection ... 10
3.2. Preprocessing Data ... 10
3.2.1. Document Cleaning Process (Cleansing) ... 10
3.2.2. Case Folding ... 10
3.2.3. Tokenizing ... 10
3.2.4. Filtering / Stopword Removal ... 10
3.2.5. Stemming ... 11
3.3. Feature Extraction ... 11
viii
3.4. Feature Selection ... 11
3.4.1. Chi - Square ... 11
3.4.2. Information Gain ... 11
3.5. Classification ... 12
3.6. Analysis ... 12
RESEARCH METHODOLOGY ... 14
4.1. Data Collection ... 14
4.2. Data Preprocessing ... 15
4.2.1. Document Cleaning Process (Cleansing) ... 15
4.2.2. Case Folding ... 15
4.2.3. Tokenizing ... 16
4.2.4. Stopword Removal ... 16
4.2.5. Stemming ... 16
4.3. Feature Extraction ... 17
4.3.1. TF (Term Frequency) ... 17
4.3.2. IDF (Inverse Document Frequency) ... 17
4.3.3. TF-IDF (Term Frequency - Inverse Document Frequency) ... 17
4.4. Feature Selection ... 18
4.4.1. Chi - Square ... 18
4.4.2. Information Gain ... 19
4.5. Classification ... 21
4.6. Confusion Matrix ... 25
IMPLEMENTATION AND RESULTS ... 26
5.1. Implementation ... 26
5.1.1. Data Collection and Import Library ... 26
ix
5.1.2. Preprocessing Data ... 26
5.1.3. Data Extraction ... 28
5.1.4. Data Selection (Chi – Square) ... 29
5.1.5. Data Selection (Information Gain) ... 29
5.1.6. Classification ... 30
5.1.7. Confusion Matrix ... 31
5.2. Results ... 31
CONCLUSION ... 37
REFERENCES ... 39 APPENDIX ... a
x
LIST OF FIGURE
Figure 4.1 Workflow ... 14
Figure 4.2 Cleansing ... 15
Figure 4.3 Case Folding ... 16
Figure 4.4 Tokenizing ... 146
Figure 4.5 Stemming ... 146
Figure 4.6 Support Vector Machine ... 214
Figure 4.7 Hyperplane... 23
xi
LIST OF TABLE
Table 3.1. Analysis ... 12
Table 4.1. Dataset ... 14
Table 4.2. Tf – IDF Result ... 18
Table 4.3. Chi - Square Example ... 19
Table 4.4. Entropy Label Data for Example ... 20
Table 4.5. Entropy A for Example ... 20
Table 4.6. Data Composition... 22
Table 5.1. Hyperparameter Test on Chi Square Results ... 32
Table 5.2. Hyperparameter Test on Information Gain Results ... 32
Table 5.3. Support Vector Machine only Results ... 33
Table 5.4. Support Vector Machine feature Selection Chi - Square Results ... 34
Table 5.5. Support Vector Machine with Information Gain Results ... 35
Table 5.6. Compare Accuracy of All Methods Results ... 35
Table 5.7. Compare F1-Score of All Methods Results ... 36