Comparative Analysis of Personality Detection using Random Forest and Multinomial Naive Bayes

(1)

Comparative Analysis of Personality Detection using Random Forest and Multinomial Naive Bayes

Azka Zainur Azifa^*, Warih Maharani, Prati Hutari Gani

School of Computing, Informatics Study Program, Telkom University, Bandung, Indonesia Email: ^1,*azkazainur@students.telkomuniversity.ac.id, ²wmaharani@telkomuniversity.ac.id,

3pratihutarigani@telkomuniversity.ac.id

Correspondence Author Email: azkazainur@students.telkomuniversity.ac.id

Abstract−Personality is a difference that is owned by each individual in thinking, feeling, and behaving. Personality is an individual characteristic that is formed based on biological parents and environmental influences. Personality type is one of the determinants of the type of work performed. The Big Five personality is a method used to detect personality. This theory divides characteristics into five dimensions, namely Openness, Conscientiousness, Extraversion, Neuroticism, and Agreeableness. Several studies have shown that personality identification can be done through social media, one of which is by using Twitter. Much research related to personality detection has been carried out using machine learning, but only focuses on one machine learning model. In the case of text detection, multinomial naive bayes have a more stable performance than random forest, while random forest has better accuracy than multinomial naive bayes. therefore this study focuses on conducting a comparative analysis using random forest and multinomial naive Bayes. the best accuracy is produced by a system with a random forest model of 60.71% and a precision value of 62% for openness personality and 57% for agreeableness personality.

Keywords: Twitter; Personality Detection; Big Five Personality; Random Forest; Multinomial Naïve Bayes

1. INTRODUCTION

Personality is a difference owned by each individual in thinking, feeling, and behaving. According to Gordon Allport, Personality is "something" in the individual concerned [1]. Personality has various theories used to determine personalities, such as the MBTI (Myers Briggs Type Indicator) and Big Five Personality. Personality can be seen in how individuals speak, share opinions, and communicate, which, of course, can be seen in the individual's daily life. Personality is also one of the determinants of the type of work that is more suitable because it involves taste and comfort [2].

People are now sharing social activities through online platforms called social media. Based on data taken in February 2022 by DataReportal[3], Twitter is one of the social media that is often used by Indonesians, with a percentage of 58.3%. Twitter users can share their social activities through tweets on Twitter. Therefore the Twitter platform can be used as a platform to detect personality, primarily through the tweets of users.

Personality detection certainly has various labels as output to classify each individual. The personality theory that will be used as a reference is the big five personality theory. Currently, psychologists believe that the description of the structure of traits is owned by the Big Five Personality[4]. Big five personality has five dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism[2]. The Openness dimension is an individual personality that has high curiosity, the Conscientiousness dimension is a very alert personality, the Extraversion dimension is an individual who is easy to socialize with, the Agreeableness dimension is an individual who tends to avoid conflict, and the Neuroticism dimension is an individual who can control emotions[2],[5]. This case requires an algorithm that can produce output from labels, namely the supervised learning algorithm.

Research on personality detection has been carried out using different supervised learning algorithm models in research conducted by Rendy Putra Pratama[5]using the Random Forest model. Random Forest is a model that can produce the best accuracy with unbalanced data. This is because Random Forest uses a Decision Tree as the basis for the classification that is built so that it can detect correlations between data. This research resulted in an accuracy rate of 69.23%. The research conducted by Nanda Yonda Hutama [6] used the Multinomial Naïve Bayes and Decision tree models. Naïve Bayes is a model that has an independent principle. This principle shows that naïve Bayes considers the data to be processed with no correlation between attributes, so balanced data is needed to maximize detection. This research resulted in an accuracy rate of 55% for Naïve Bayes and 33% for the Decision Tree. Therefore this study reveals that Naïve Bayes is a stable model for detection.

Another study regarding the Big Five personality detection was conducted by S. V. Therik [7] using the C4 model. This study obtained an accuracy of 62.02%. This accuracy was obtained because it used weighted data using TF-IDF and LIWC, with an increase of 17.24% compared to the baseline. Besides that, S. V. Therik implemented the SMOTE implementation, thereby increasing the accuracy value to 76.92%, an increase of 32.1%

from the baseline. Finally, in the research by Y. A. I. Nanda [8]. This research predicts Personality Disorder using Naïve Bayes. This study validates it with the Confusion Matrix. The accuracy obtained is 88.2%. The results of these studies produce a good performance.

Random Forest and Multinomial Naïve Bayes have differences in performance. In the case of text detection, multinomial naïve Bayes has more stable performance. After all, it utilizes probability theory [6]. Random Forest has better accuracy because it has more features than multinomial naïve bayes.

(2)

Based on the background, this study compares the random forest model and the multinomial naive Bayes model to determine which algorithm is more suitable for personality detection cases. Personality theory focuses on the big five personality theory, and the data used is tweet data from Twitter to detect personality based on the Big Five Personality rules. Then the program will classify data and create models using random forests and multinomial naive bayes. Tests were carried out as many as three tests in the form to determining the best algorithm in this study in the form of an algorithm that has high accuracy and precision.

This study has a problem limitation in the form of the dataset used is tweets from Twitter that use Indonesian. This data is retrieved using the Crawling data method with the Twitter API. The system cannot handle typos, and personality theory focuses on the big five personality theories.

2. RESEARCH METHODOLOGY

2.1 Research Stages

The system to be built in this study aims to be able to detect personality based on the Big Five Personality by using Random Forest and Naïve Bayes. The data to be used is data obtained through Twitter social media. To carry out this detection, it must go through several stages in the form of Data Collection, Pre-Processing data, Split data, model training, and evaluation. Can be seen in Figure 1.

Figure 1. System Flow Chart 2.2 Collecting Data

This study took data from Twitter users who were willing to be respondents and had filled out the Big Five personality questionnaire. The data taken is comment data in Indonesian with the username. The data retrieval process is carried out using the Twitter API that has been provided. The results of crawling data are stored in documents in the form of .csv which are continued to do personality labeling based on the big five personalities.

The Big Five Personality Theory is often used in recruitment, so the public widely uses this theory. Big Five divides personality into five dimensions: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. The Openness to Experience dimension is an individual personality dimension interested in doing something new. The opposite trait of this dimension is an individual who has anxiety when given a new challenge [2]. Conscientiousness is an individual dimension that has a high level of alertness. This individual is known for his discipline and reliability. Extraversion is a personality dimension that likes and is comfortable communicating with others. Then the Agreeableness dimension tends to avoid conflict [2] Lastly, Neuroticism, namely, individuals who are able to control emotions, be it stress or pressure [9].

2.3 Preprocessing Data

Preprocessing of data is done to remove noise in the data. The data preprocessing stage consists of data cleaning, case folding, stopword removal, stemming, and tokenization. The data cleaning phase focuses on cleaning data from symbols, numbers, and punctuation. For the Case Folding Stage, the data focuses on converting capital letters into lowercase letters. This stage is continued in Stopword Removal to remove words that have no meaning. The stemming stage focuses on finding the original word of the word [10]. The tokenization stage focuses on separating sentences into a series of words [10].

2.4 Feature Extraction

The Feature Extraction stage is a data weighting process using the TF-IDF method. TF-IDF (Term Frequency and Inverse Document Frequency) is a calculation or weighting of words using tokenization, stopword, and stemming techniques [11]. This method identifies words as vectors[11]. weighting is done to give a value to be the input data

(3)

in the classification process [12]. Therefore, this method is used for classification purposes because it has accurate results. The TF-ID formula can be seen in equation (1) [5].

Wdt = TFdt * IDFt (1)

With information for W representing the weight of the document, d representing the d document, t representing the t word, TF representing the frequency of the searched word, and IDF representing the number of documents / number of documents containing equation 1.

2.5 Random Forest Classification

Random Forest is a Supervised Learning method used in classification and regression cases. The Random Forest method creates different training subsets with sample data with replacement and output based on voting [13].

Random Forest combines the results obtained from several estimators built by the Decision Tree collection. The more estimators used, the better the performance will be. The Random Forest method can manage inaccurate data because the model used is not only trained with different data sets but also uses other features to make decisions [14]. This algorithm requires data that already has a label to facilitate data processing. Therefore, the Random Forest method can process relatively large amounts of data.

2.6 Multinomial Naïve Bayes Classification

Naive Bayes is a Supervised Learning method that uses probability and statistical methods. Multinomial Naive Bayes is a model developed from the Naive Bayes algorithm to process text or documents by counting word frequencies[6]. The Multinomial Naive Bayes method uses probability by estimating the probability set and calculating the occurrence of data in the data set [14]. Multinomial Naive Bayes is a naive algorithm but has advantages in the simplicity of calculation, high precision, and speed of data processing. This algorithm requires data that already has a label to facilitate data.

2.7 Grid Search

Grid Search selects a combination of models and hyperparameters by testing and validating each combination [15].

Grid search uses Cross Validation to perform validation without having to do validation manually. The weakness of the grid search is that the tuning process takes time when hyperparameters are added because the number of parameter combinations increases exponentially [16]. This is because Grid search has a brute force principle in processing these parameters.

2.8 SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is a method for balancing unbalanced data using oversampling. The SMOTE method works by finding the k-nearest neighbors or the closest data neighbors as much as k for each data in the minority class [17], [18]. After that, synthetic data is created with the desired percentage of duplication between minor data and k-nearest neighbors [17]. Therefore SMOTE can avoid the problem of data overfitting.

2.9 Confusion Matrix

The confusion matrix is one of the performance measurement media and classification results with a two- dimensional matrix. The matrix is divided into two labels in the form of positive and negative, equipped with four different combinations. These combinations are True Positive, True Negative, False Positive, and False Negative[19], as illustrated in table 1.

Table 1. Confusion Matrix Actual Class Predicted Class

Positive Negative

Positive TP FN

Negative FP TN

The way to see and measure the performance of the results of the classification can be done by calculating accuracy, precision, recall, and F1-Score. Accuracy is the ratio of correct predictions, precision is the ratio of correct optimistic predictions compared to all positive predicted results, recall is the ratio of correct optimistic predictions compared to all correct positive data, and F1-Score is a weighted comparison of the average precision and recall [20]. It is used to assess the performance of the model. The following equation is for calculating accuracy, Precision, Recall, and F1-Score.

Accuracy = (TP + TN) / (TP + TN + FP + FN ) (2)

Precision = TP / (TP+FP) (3)

Recall = TP / (TP+FN) (4)

(4)

F1-Score = 2 * (precision*recall) / (recall+precision) (5)

3. RESULT AND DISCUSSION

This stage will explain the results of the big five personality detection test using random forests and multinomial naïve bayes. The data used is 274 user data with 10,000 tweets for each user. The data is equipped with a personalized label that has been calculated using the big five rules. The distribution of label data obtained can be seen in figure 2, which indicates that the data is imbalanced. Then the test is carried out by nine split comparison tests of test data and training data, two-parameter tests, and testing of imbalance and balance data using SMOTE.

Figure 2. Distribution of data labels

The data will be prepared before entering the classification process. This process is carried out to retrieve data suitable for the classification process. Data will be prepared by going through the Case Folding, Stopword Removal, Stemming, and Tokenization processes. The data preparation process can be seen in table 2.

Table 2. Preprocessing Result

Process result

Actual Data

AttaMufid, 2021-06-21 11:20:30,1406935049753137160,Ga masalah sih tiap orang punya tujuan dan pencapaiannya masing2. Tapi mo appreciate aja temen2 yang sama2 di kondisi genting kek gini tapi tetap peduli sama temennya.

Case Folding

attamufid ga masalah sih tiap orang punya tujuan dan pencapaiannya masing tapi mo appreciate aja temen yang sama di kondisi genting kek gini tapi tetap peduli sama temennya Stopword

Removal

attamufid ga masalah sih tiap orang punya tujuan dan pencapaiannya masing tapi mo aja temen yang sama di kondisi genting kek gini tapi tetap peduli sama temennya

Stemming attamufid ga masalah sih tiap orang punya tuju dan capai masing tapi mo aja temen yang sama di kondisi genting kek gin tapi tetap peduli sama temennya

Tokenisasi

‘attamufid’,‘ga’, ‘masalah’, ‘sih’ , ‘tiap’, ‘orang’ , ‘punya’ , ‘tuju’, ‘dan’ , ‘capai’ , ‘masing’ ,

‘tapi’ , ‘mo’ , ‘appreciate’ , ‘aja’ , ‘temen’ , ‘yang’ , ‘sama’ , ‘di’ , ‘kondisi’ , ‘genting’ , ‘kek’

, ‘gin’ , ‘tapi’ , ‘tetap’ , ‘peduli’ , ‘sama’ , ‘temennya’

The next step is feature extraction. Feature Extraction is a data weighting process using the TF-IDF method.

An example of the results of weighting using the TF-IDF is the word "masalah" weights 0.0101, "paham" has a weight of 0.014, and "peduli" has a weight of 0.008.

3.1 Test Analysis on Random Forest

The test was carried out with nine combinations of comparison of test data and training data and was completed with two parameter comparisons based on Grid Search. This test uses a random forest model with imbalanced data. The parameters of the random forest model are in table 3.

Table 3. Parameter Random Forest

First Parameter Second Parameter

n_estimator = 500, min_samples_split = 5,

criterion = ‘entropy’, and random_state = 19. n_estimator = 300, min_samples_split = 5, criterion =

‘gini’, max_features = 20, and random_state = 19.

The difference between the two parameters is in the Max_Features parameter. Max_feature has a function to determine the features that need to be weighed to produce the best performance. The results of the first test on the Random Forest are in Table 4.

Table 4. Accuracy Results of Random Forest Testing

Test Data : Training Data First Parameter Accuracy Second Parameter Accuracy

0,4791667 60.71% 57.41%

0,8888889 56.36% 49.09%

(5)

30 : 70 49.39% 48.2%

40 : 60 48.18% 47.27%

50:50:00 44.52% 44.52%

60:40:00 47.27% 43.64%

70:30:00 42.71% 35.42%

80:20:00 36.81% 29.54%

90:10:00 33.20% 28.74%

The first test on a random forest produces the highest accuracy in the comparison of test data 10: 90 using the first parameter of 60.71%. In the second parameter, the highest accuracy is in the comparison of test data and training data 10 : 90 of 57.41%. Then the highest accuracy result for the random forest model is 60.71%. After accuracy has been obtained, the model is evaluated using a confusion matrix to represent the predicted results of the random forest. This representation takes only the best accuracy obtained on this test. Table 5 is a visualization of the confusion matrix at the best accuracy using the random forest.

Table 5. Confusion Matrix Random Forest

Total Predicted Class

-28 Openness Constientiousness Extraversion Agreeableness Neuroticism

-21 0 0 -7 0

Openness

13 0 0 2 0

Actual

Class -15

Constientiousness

1 0 0 0 0

^-1

Extraversion

0 0 0 0 0

0

Agreeableness

2 0 0 4 0

-6

Neuroticism

5 0 0 1 0

^-6

Based on table 5, the random forest model succeeded in predicting Openness personality in 13 data, and Agreeableness personality was correctly predicted in 4 data. After doing the calculation of the confusion matrix.

The evaluation is continued by looking for Precision, Recall, and F1-Score values for the best accuracy. The Precision, Recall, and F1-Score of the model are obtained by using equations (3), (4), and (5) so that they can be further evaluated regarding the performance of the model that has been built, as shown in Table 6.

Table 6. Precision, Recall, and F1-Score of Random Forest Label Precision Recall F1-Score

Openness 0.62 0.87 0.62

Conscientiousness 0 0 0

Extraversion 0 0 0

Agreeableness 0.57 0.67 0.72

Neuroticism 0 0 0

Based on table 6, Random forest can detect Openness and Agreeableness labels. This can be seen from the precision obtained for the Openness label of 0.62, recall of 0.87, and F1_score of 0.62, and for the Agreeableness label, precision of 0.57, recall of 0.67, and F1_score of 0.72. Personality conscientiousness, extraversion, and neuroticism produce a precision value of 0 in the random forest model. None of these three personalities can be predicted correctly. This is caused by the unequal distribution of the data, which is attached in figure 2. The evaluation results show that the random forest model is very good at predicting openness and agreeableness personalities.

Characteristic of the data is very influential in classifying. Based on the confusion matrix in the random forest, there are samples of neuroticism data detected as openness. This data is user data @msyitams. This is caused by the distribution of Openness data more than the Neuroticism data so that the model learns about Openness data.

3.2 Test Analysis on Multinomial Naïve Bayes

This test is carried out with nine combinations of comparison of test data and training data and is equipped with two comparisons of parameters. Parameter comparison using default parameters from the model and parameters

(6)

based on Grid Search. This test uses a naive Bayes multinomial model with imbalanced data. The parameters of the naive Bayes multinomial model are listed in Table 7.

Table 7. Parameter Multinomial Naïve Bayes First Parameter Second Parameter alphaa = 1.0, and random_state = 42 alphaa = 5.0, and random_state = 19

The difference between the two parameters lies in the value used. The first parameter is the default parameter from multinomial naive bayes and the second parameter is a parameter based on Grid Search. The results of the first test on Multinomial Naive Bayes are obtained in table 8.

Table 8. Accuracy Results of Multinomial Naïve Bayes Testing

0,4791667 21.43% 60.71%

0,8888889 29.1% 49.09%

30 : 70 35.37% 46.34%

40 : 60 40.37% 46.78%

50:50:00 44.12% 46.32%

60:40:00 29.27% 36.58%

70:30:00 39.8% 42.41%

80:20:00 39.4% 40.82%

90:10:00 28.16% 40.41%

Testing on multinomial naïve Bayes produces the highest accuracy on the comparison of test data 10 : 90 using the second parameter of 60.71%. In the first parameter, the highest accuracy is in the comparison of test data and training data 10 : 90 of 44.12%. Then the highest accuracy result for the naïve Bayes multinomial model is 60.71%. After accuracy has been obtained, the model is evaluated using a confusion matrix to represent the predicted results of multinomial naïve Bayes. This representation takes only the best accuracy obtained on this test. Table 9 visualizes the confusion matrix at the best accuracy using the Naive Bayes multinomial model.

Table 9. Confusion Matrix Multinomial Naïve Bayes

-28 0 0 0 0

Openness

17 0 0 0 0

Actual

Class -17

Constientiousness

1 0 0 0 0

-1

Extraversion

0 0 0 0 0

0

Agreeableness

3 0 0 0 0

^-3

Neuroticism

7 0 0 0 0

-7

Based on table 10, the naive Bayes multinomial model succeeded in predicting Openness personality in 17 data. Model performance is assessed from Precision, Recall, and F1-Score. The Precision, Recall, and F1-Score of the model are obtained by using equations (3), (4), and (5) so that they can be further evaluated regarding the performance of the model that has been built, as shown in Table 10.

Table 10. Precision, Recall, and F1-Score of Multinomial Naïve Bayes Label Precision Recall F1-Score

Openness 0.61 1 0.76

Extraversion 0 0 0

Agreeableness 0 0 0

Neuroticism 0 0 0

(7)

Based on table 10, Multinomial Naive Bayes can only detect the Openness label with a precision of 0.61, a recall of 1, and an F1 score of 0.76. Agreeableness, conscientiousness, extraversion, and neuroticism personalities produce a precision value of 0, indicating that none can be predicted correctly. This is caused by the unequal distribution of the data, which is attached in figure 2. The evaluation results of the recall calculation show that the multinomial naive Bayes model can detect all Openness labels. So multinomial naïve Bayes is very good at detecting Openness labels.

Characteristic of data influences the learning model of multinomial naive Bayes in classifying. Based on table 10, there are samples of Agreeableness data detected as openness. This data is user data @adinnda_ekhaa.

This is caused by the distribution of Openness data more than the Agreeableness data so that the model learns about Openness data.

3.3 Analysis of SMOTE Implementation Testing

Based on the test analysis on random forest and multinomial naïve Bayes, both algorithms have the highest accuracy value at the same amount. The data used in this test is imbalanced, so this test focuses on adjusting unbalanced data to be balanced using the SMOTE method. SMOTE implementation influences the distribution of test data and training data, so SMOTE is implemented based on the three test data with the best accuracy. The three divisions of test data and training data are divisions 10:90, 20:80, and 30:70. The distribution of label data with SMOTE implementation is shown in table 11.

Table 11. SMOTE Label Data Distribution Test Data : Training

Data Openness Conscientiousness Extraversion Agreeableness Neuroticism

0,4791667 93 93 93 93 93

0,8888889 83 83 83 83 83

30 : 70 72 72 72 72 72

SMOTE testing is carried out on both algorithms using the parameters that produce the best accuracy. Based on testing on random forest, random forest is able to produce the best accuracy with the first parameter. For multinomial naive bayes, the best accuracy is obtained with the second parameter. The results of the SMOTE test are represented in table 12.

Table 12. SMOTE Test Accuracy Results

Test Data : Training Data Random Forest Multinomial Naïve Bayes First Parameter Second Parameter

0,4791667 57.14% 25%

0,8888889 47.27% 21.18%

30 : 70 46.34% 13.41%

SMOTE testing has decreased the resulting accuracy. The most considerable decrease occurred in Multinomial Naive Bayes at 35.7%. While in the random forest, there was a decrease of 3.57%. The decrease in accuracy in multinomial naive bayes indicates that multinomial naive bayes are not suitable for balancing data using SMOTE. Therefore the best results were obtained in the random forest in the data balancing test using SMOTE. Evaluation using the confusion matrix in the SMOTE test is applied only to tests with the best accuracy.

Then an evaluation is carried out regarding the random forest model using the first parameter in comparing test data and training data of 10:90. This is intended to visualize predictions on the results that have been obtained.

Visualization of the confusion matrix of the SMOTE test is obtained in table 13.

Table 13. Confusion Matrix SMOTE

-26 0 0 -2 0

Openness

15 0 0 2 0

Actual

Class -17

Constientiousness

1 0 0 0 0

-1

Extraversion

0 0 0 0 0

⁰

Agreeableness

3 0 0 0 0

-3

(8)

-26 0 0 -2 0

Neuroticism

7 0 0 0 0

^-7

Based on table 13, the random forest model succeeded in predicting Openness's personality in 15 data. The evaluation is continued by calculating the Precision, Recall and F1-Score values based on equations (3), (4), and (5). The results of the precision, recall, and F1-Score values can be seen in table 14.

Table 14. Precision, Recall, and F1-Score of SMOTE Label Precision Recall F1-Score

Openness 0.58 0.88 0.70

Extraversion 0 0 0

Agreeableness 0 0 0

Neuroticism 0 0 0

Based on table 14, Personality Agreeableness, Conscientiousness, Extraversion, and Neuroticism were not successfully predicted in the model because the precision, recall, and F1-Score data have a value of 0%. This is caused by the unequal training and test data distribution, as visualized in table 14. The evaluation results obtained show that the random forest model is superior to the naive Bayes multinomial model.

4. CONCLUSION

This research focuses on conducting a comparative analysis of a system built using random forest and multinomial naive Bayes to detect the big five personalities of Twitter users. The big five personalities have five dimensions Openness to Experience, Extraversion, Neuroticism, Agreeableness, and Conscientiousness. The test consists of three tests in the form of a comparison of test data and training data, a comparison of parameters and a comparison using balanced data with the help of SMOTE. In Random forest testing produces the highest accuracy in comparing test and training data of 10:90 at 60.71%.These results are evaluated using a confusion matrix and calculating precision. The Openness label gets a Precision value of 0.62. The Agreeableness label gets a Precision value of 0.57. Therefore, the model from Random Forest is suitable for predicting the Openness and Agreeableness labels.

In the Naïve Bayes Multinomial Test and produces the highest accuracy at 60.71%. The evaluation results obtained by multinomial naïve Bayes are the values of precision, recall, and f1-score from the Openness label. The precision value obtained is 0.61, the recall is 1, and the f1-score is 0.76. So the model from Multinomial Naïve Bayes is perfect for detecting all Openness labels because the recall of Openness labels is 1. In the SMOTE Test, both models experience a decrease in accuracy. The accuracy obtained from the random forest is superior to the Multinomial Naïve Bayes model. So the random forest is a reasonably stable model detecting the Big Five personalities. Based on the accuracy and precision obtained from each test, the best model for comparative analysis of the big five personality detection cases is the random forest.

REFERENCES

[1] Sari, “PENGARUH BEBAN KERJA, PENGALAMAN, TIPE KEPRIBADIAN, DAN KOMPETENSI AUDITOR TERHADAP SKEPTISME PROFESIONAL,” Uii.ac.id, 2018.

[2] T. Simanullang, “PENGARUH TIPE KEPRIBADIAN THE BIG FIVE MODEL PERSONALITY TERHADAP KINERJA APARATUR SIPIL NEGARA (KAJIAN STUDI LITERATUR MANAJEMEN KEUANGAN),” vol. 2, no.

2, 2021, doi: 10.38035/jmpis.v2i2.

[3] “DataReportal Digital 2022: Indonesia — DataReportal – Global Digital Insights, “DataReportal – Global Digital Insights,” DataReportal – Global Digital Insights, Feb. 15, 2022.

[4] M. Fikry et al., “Klasifikasi Kepribadian Big Five Pengguna Twitter dengan Metode Naïve Bayes,” 2018.

[5] R. P. Pratama and W. Maharini, “Predicting Big Five Personality Traits Based on Twitter User U sing Random Forest Method,” 2021 International Conference on Data Science and Its Applications (ICoDSA), Oct. 2021.

[6] N. Y. Hutama, K. M. Lhaksmana, and I. Kurniawan, “Text Analysis of Applicants for Personality Classification Using Multinomial Naïve Bayes and Decision Tree,” JURNAL INFOTEL, vol. 12, no. 3, pp. 72–81, Aug. 2020, doi:

10.20895/infotel.v12i3.505.

[7] S. V. Therik and E. B. Setiawan, “Deteksi Kepribadian Big Five Pengguna Twitter Dengan Metode C4.5,” eProceedings of Engineering, vol. 8, 2021.

[8] Y. Aditama, I. Nanda, ; Bety, and W. Sari, “Techno Nusa Mandiri: Journal of Computing and Information Technology As an Accredited Journal Rank 4 based on SK Dirjen Risbang SK Nomor,” TECHNO Nusa Mandiri Journal, vol. 17, no.

1, 2020, [Online]. Available: www.amikom.ac.id

(9)

[9] S. A. Utami, N. Grasiaswaty, and S. Z. Akmal, “Hubungan Tipe Kepribadian Berdasarkan Big Five Theory Personality dengan Kebimbangan Karier pada Siswa SMA Relationship between Types of Personality Based on Big Five Theory Personality with Career Indecision among High School Students,” 2018.

[10] Kerem Kargın, “NLP: Tokenization, Stemming, Lemmatization and Part of Speech Tagging,” Medium, Feb. 27, 2021.

[11] M. A. Rofiqi, Abd. C. Fauzan, A. P. Agustin, and A. A. Saputra, “Implementasi Term-Frequency Inverse Document Frequency (TF-IDF) Untuk Mencari Relevansi Dokumen Berdasarkan Query,” ILKOMNIKA: Journal of Computer Science and Applied Informatics, vol. 1, no. 2, pp. 58–64, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.18.

[12] R. Kosasih and A. Alberto, “Analisis Sentimen Produk Permainan Menggunakan Metode TF-IDF Dan Algoritma K- Nearest Neighbor,” vol. 6, no. 1, 2021, doi: 10.30743/infotekjar.v6i1.3893.

[13] C. Sindermann, R. Mõttus, D. Rozgonjuk, and C. Montag, “Predicting current voting intentions by Big Five personality domains, facets, and nuances – A random forest analysis approach in a German sample,” Personality Science, vol. 2, Sep.

2021, doi: 10.5964/ps.6017.

[14] R. M. Awangga and N. H. Khonsa’, “Analisis Performa Algoritma Random Forest dan Naive Bayes Multinomial pada Dataset Ulasan Obat dan Ulasan Film,” InComTech : Jurnal Telekomunikasi dan Komputer, vol. 12, no. 1, p. 60, Apr.

2022, doi: 10.22441/incomtech.v12i1.14770.

[15] A. Toha, P. Purwono, and W. Gata, “Model Prediksi Kualitas Udara dengan Support Vector Machines dengan Optimasi Hyperparameter GridSearch CV,” Buletin Ilmiah Sarjana Teknik Elektro, vol. 4, no. 1, pp. 12–21, May 2022, doi:

10.12928/biste.v4i1.6079.

[16] W. Nugraha and A. Sasongko, “Hyperparameter Tuning on Classification Algorithm with Grid Search,” SISTEMASI, vol. 11, no. 2, p. 391, May 2022, doi: 10.32520/stmsi.v11i2.1750.

[17] R. Siringoringo, “KLASIFIKASI DATA TIDAK SEIMBANG MENGGUNAKAN ALGORITMA SMOTE DAN k- NEAREST NEIGHBOR,” 2018.

[18] E. Sutoyo, M. Asri Fadlurrahman, J. Telekomunikasi Jl Terusan Buah Batu, K. Dayeuhkolot, K. Bandung, and J. Barat,

“JEPIN (Jurnal Edukasi dan Penelitian Informatika) Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network”.

[19] Sabrani Alif, “KLASIFIKASI ARTIKEL ONLINE TENTANG GEMPA DI INDONESIA MENGGUNAKAN MULTINOMIAL NAÏVE BAYES,” Publikasi Tugas Akhir S-1 PSTI FT-UNRAM, 2020.

[20] R. Arthana, “Mengenal Accuracy, Precision, Recall dan Specificity serta yang diprioritaskan dalam Machine Learning,”

Medium, Apr. 05, 2019.