Big Five Personality Detection Based on Social Media Using Pre-Trained IndoBERT Model and Gaussian Naive Bayes

(1)

Big Five Personality Detection Based on Social Media

Using Pre-Trained IndoBERT Model and Gaussian Naive Bayes

Ni Made Dwipadini Puspitarini^*, Yuliant Sibaroni, Sri Suryani Prasetiyowati School of Computing, Informatics, Telkom University, Bandung, Indonesia

Email: ¹[email protected], ²[email protected],³[email protected] Correspondence Author Email: [email protected]

Abstract−A person's personality offers a thorough understanding of them and has a significant role in how well they perform at work in the future. No wonder it attracted the interest of the researcher to develop a personality detection system. Although much research about personality detection through social media was conducted, this task has been challenging to implement, especially using conventional machine learning. The issue is conventional machine learning still insufficient to make the personality detection system perform better. The purpose of this research is to detect Big Five personalities based on Indonesian tweets and increase its performance by combining machine learning with deep learning, which is Gaussian Naive Bayes and IndoBERT model. The proposed combined model in this research is summing the log probability vector on each model.

Gathered 3.342 tweets from 111 Twitter accounts that were used as a dataset. This research also implemented min-max normalization to rescale the data. The result showed that for the entire dataset, the combined model has more accuracy score than Gaussian Naive Bayes by 5.42% and IndoBERT by almost 2%, which indicates the combined model is better than the Gaussian Naive Bayes and IndoBERT models.

Keywords: Big Five Personality; Combined Model; Gaussian Naive Bayes; IndoBERT; Log Probability Value; Personality Detection.

1. INTRODUCTION

The role of social media in this era has developed, which was initially only for networking, now it is used as a place for self-presentation [1]. Currently, social media is no longer an unfamiliar place for most people to pour out their hearts and minds, as well as a place to voice their opinions. Not surprisingly, social media is a treasure trove for experts to make research material. One example is tweets or retweets shared by users on their social media accounts, which can be used as data for researchers to analyze and predict their personality traits [2].

Personality is a characteristic that reflects how individuals react to their environment. Each individual is unique, and this causes them to have different personalities from each other. This characteristic tends to be hard to change yet not impossible since individuals interact with strength and the environment around them [3]. Many guidelines can be used to predict a person's personality, for example, Big Five Personality Traits, Myers Briggs Type Indicator (MBTI), StrengthsFinder, and DISC Personality [4]. Personality detection using Big Five has been used for a long time to predict how someone’s performance is performed in the future, especially for companies recruiting new employees [5]. Predicting someone's performance by looking at the personality highlighted through their social media account is said to provide a better picture when compared to holding a personality test [5].

Several previous studies have been successfully conducted to detect personalities based on their social media data. The previous study by [6] used the Naive Bayes Classifier method to predict Big Five personalities based on Twitter and achieved an accuracy of 42.71%. The dataset used in this study is dominant labeled Agreeableness, with a total of 167 accounts. Meanwhile, the lowest label is Extraversion, with a total of 14 accounts. The author stated that the low accuracy results are caused by the amount of data that is not balanced. A similar study conducted by [7], detected the Big Five personalities via Facebook using the Naive Bayes method.

This study was conducted with two test scenarios. The first test consists of a scenario related to data preprocessing whether changing the form of a word will affect the accuracy. Then, the second test consists of a scenario that implements prior probability value. The highest accuracy in this study achieved 59.9% on the first test and 60.2%

on the second. Another study conducted by [8], detected the Big Five personalities via Twitter using the Naive Bayes method. The dataset used in this study consisted of 1500 tweet data obtained from 15 accounts, and the psychologist did the labeling directly. Initially, the number of accounts used in this study was 95 accounts.

However, to overcome the data imbalance, only three were taken for each Big Five personality type. Thus, the total accounts used are reduced to 15 accounts. The accuracy results obtained in this study were outstanding, which is 86.66%. A previous study conducted by [9], predicted Big Five personalities by comparing several machine learning approaches, consisting of Multinomial Naive Bayes, AdaBoost, and Linear Discriminate Analysis (LDA).

Their results stated that Multinomial Naive Bayes performs better than AdaBoost and LDA. Moreover, Multinomial Naive Bayes achieved the highest accuracy score of 73.43% for the feature Openness. A comparative study was also conducted by [10] that evaluated Naive Bayes and Support Vector Machine (SVM) to predict Big Five personality traits. The results showed that Naive Bayes outperformed SVM with an accuracy score of 60%.

Although several previous studies have been successfully conducted, personality detection is still a challenging problem in cognitive computation and a conventional method approach for this problem most likely is inadequate to get promising results [11]. Based on several studies described above, Naive Bayes actually has better performance results for detecting Big Five personality traits than other machine learning methods. However,

(2)

most of the achieved accuracy score is still not sufficient enough to predict five labels in the Big Five personality.

Thus, it is necessary to find out how to optimize machine learning performance for personality detection. Another study that focused on personality detection conducted by [12], predicted Big Five personality traits by combining machine learning and deep learning, which are SVM and IndoBERT. The idea of this research was to implement BERT as a semantic approach and SVM as a classifier. The dataset used in this study consisted of 511,617 tweets from 295 Twitter accounts. As the baseline model, SVM achieved an accuracy score of 57.97%. After combining SVM with BERT, SMOTE, and LIWC, the accuracy score increased to 80.07%. It is shown that combining SVM with BERT can improve the performance result. In previous studies, BERT is still more widely used to handle classification cases such as sentiment analysis and it is still rare to find studies that implement BERT for personality detection [12]. A similar approach by combining machine learning and deep learning was also conducted by [13] to detect emotion from tweets using BERTweet and SVM. The idea of this research was to sum the log probability value from each model. The dataset used in this study has five labels consisting of joy, anger, fear, sadness, and neutral. Their results after combining these two models to predict five labels were outstanding, with an accuracy score of SVM 84%, BERTweet 89%, and the proposed combine model 91%. A previous study conducted by [14] combined BERT with Bayesian Network to classify the governance texts. Same as previous results, this study also performs better by combining machine learning and deep learning with an accuracy score of 94.59%. This accuracy score was 3.23% better than BERT and 18.06% better than Bayesian Network.

Based on the explanation above, it is shown that combining machine learning with deep learning will help the model to improve its performance results. Therefore, this research is aimed to combine Gaussian Naive Bayes and IndoBERT models to detect Big Five personalities based on social media Twitter. According to previous studies, this proposed combined model has the potential to produce a good performance result. This research will experiment to detect Big Five personalities from tweets in Indonesian. It will remove all tweets in another language due to the use of IndoBERT, a BERT base model trained using an Indonesian language document.

2. RESEARCH METHODOLOGY

2.1 System Design

The system design for this research is illustrated in Figure 1. This personality detection system will begin by crawling the data to collect the dataset. The data that has been collected will be filtered and chosen to continue to the next step, which is labeling and preprocessing. After passing the preprocessing step, the next process is to split data using K-Fold Cross Validation into data train and data test. The Data train will enter the classification process using IndoBERT and Gaussian Naive Bayes. The output for this classification is the log probability value that will be summed for each method as the combined model. After that, the process continues to predict the model using data tests and combined model results. Last, the model will be evaluated to find out the performance result.

Figure 1. System Flowchart 2.2 Data Crawling

Data crawling is a process or step of extraction with the aim of collecting data to be used as a dataset for analysis.

This research will collect data through social media Twitter using the Twitter API. Attributes that will be taken in this process are data tweets and usernames in Twitter accounts. The total data that was successfully collected in this research is 90.260 tweets from 111 Twitter accounts. A total of 22.208 Indonesian tweets were picked from 90.260 data, and these Indonesian tweets will be filtered again to collect the final dataset.

2.3 Labeling

Big Five divide its personality into five parts consisting of Openness (O), Conscientiousness (C), Extrovert (E), Agreeableness (A), and Neuroticism (N) [12]. Openness is a personality trait that is open to a new thing, tend to be curious, and has a high level of imagination. Conscientiousness is a personality that is working hard and goal- oriented individual. Extrovert is a personality that is cheerful, happy, and very sociable. Agreeableness is a

(3)

Ni Made Dwipadini Puspitarini, Copyright © 2023, MIB, Page 269 personality that prefers to avoid a problem that might occur, humble, and generous. Last, Neuroticism is a personality that gets worried easily, sad, and depressed.

The labeling process is not done to all collected data. However, it is done on the selected data only. On average, picked a total of 30 tweets on each Twitter account as a dataset that will be labeled. The labeling process in this research is done manually based on the characteristics of each Big Five personality listed in previous research conducted by [15]. The distribution of labeled tweets on each Big Five personality is shown in Figure 2.

Figure 2. Personality Traits Distribution 2.4 Preprocessing Data

Preprocessing is an essential process of personality prediction [16]. The definition of preprocessing is the step to prepare the data and change its form into a feasible format. In this research, there are several steps of data preprocessing consisting of case folding, data cleaning, stopwords removal, and stemming. Case folding is the conversion process of capital letters to lowercase letters. Data cleaning is a cleaning process that removes symbols, URLs, and numbers. Stopwords removal is the process of removing unnecessary or meaningless words in a sentence using the Sastrawi library. Last, word stemming is the process of converting words into their basic forms.

The preprocessing result in this dataset is shown in Table 1.

Table 1. Preprocessed Tweets

Number Tweet Preprocessed Tweet

1 @HeyBudie: Bukannya checkout barang malah checkout emosi.

bukan checkout barang bahkan checkout emosi

2 @Askrlfess gatau deh, dia sibuk skripsian dan jarang

chat🥲 gatau deh sibuk skripsi jarang chat

3 Saya sih merasa dia yang emang paling ngehargain dibanding yg lain

sih rasa emang paling ngehargain banding yang lain 2.5 K-Fold Cross Validation

K-Fold Cross is a type of cross-validation testing technique that will measure the performance of a system that has been built by arbitrarily partitioning data and grouping the data according to the number of k in k-fold [17].

Therefore, the use of K-Fold Cross Validation is related to splitting data into data test and data train. The dataset that has been divided will be processed k times, where each experiment uses the k partition data as the test data and the rest of the data will be used as training data. Thus, every fold in the training set will appear for k-1 times.

Meanwhile, every fold in the test set will appear only once [18]. Most of the previous studies choose ten as their value of k for sentiment analysis, personality detection, and others. Therefore, the value of k for this research will be set as ten.

2.6 BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT works bidirectionally, meaning that BERT learns information by reading a series of words from both sides, from left to right and right to left.

BERT uses Transformers to study the contextual relationship in a series of words. Transformers have an architecture consisting of encoder and decoder. The function of the encoder is to change the input in the form of a sentence into an output representation. Meanwhile, the decoder works to produce a prediction. An illustration of how BERT works is shown in Figure 3.

BERT is pre-trained on two unsupervised tasks, which are MLM or Masked Language Model and NSP or Next Sentence Prediction [19]. BERT is trained through 800 million words on the BooksCorpus dataset in English language. Since this research will use Indonesian language tweets, therefore IndoBERT will be applied as a method. IndoBERT has an architectural design like BERT, but is pre-trained through two-phase uses Indo4B, contains a corpus of four billion words and around 250 million sentences in Indonesian language. In the first phase,

(4)

pre-training uses a Maximum Sequence Length of 128 and the second phase uses a Maximum Sequence Length of 512. The difference in size in the second phase aims to allow IndoBERT to learn more than in the first phase.

Figure 3. BERT Architecture [19]

2.7 Naive Bayes

Naive Bayes is one of the classification methods based on the Bayes theorem influenced by the probability value, and all the attributes are assumed to stand independently [20]. The concept of this classification method is to predict future probabilities based on the attributes possessed in the past. Some advantages of using the Naive Bayes method are that the application is somewhat more accessible, faster, and more efficient and can still work optimally on not so much training data [20]. Mathematically, the equation of Naive Bayes is contained in equation (1). The proposed Naive Bayes model in this research is Gaussian Naive Bayes. Gaussian Naive Bayes assumes that all the attributes located in each label come from calculating the Gaussian distribution. This calculation is used when the attribute has a continuous value. The formula for the Gaussian calculation is contained in equation (2).

In order to boost Gaussian Naive Bayes performance, this research will also implement TF-IDF as feature extraction since feature extraction is an essential process[21]. TF-IDF is a term weighting technique or word weighting technique that can represent how important the word is in a data set. TF-IDF is the product of the multiplication of two methods, namely Term Frequency (TF) and Inverse Document Frequency (IDF). TF will calculate how many times the word appears in a document, while IDF will calculate the log of how many documents store the searched word [22]. The equation of TF-IDF is contained in equation (3).

𝑃(𝐶|𝑋) = ^𝑃(𝑋|𝐶) × 𝑃(𝐶)

𝑃(𝑋) (1)

𝑃(𝑥_𝑖 | 𝑦) = ¹

√2𝜋𝜎𝑦2 e (−^(𝑥^𝑖^{− 𝜇}^𝑦⁾

2

2𝜎_𝑦² ) (2)

TF(𝑡, 𝑑) × IDF(𝑡, 𝐷) = TF(𝑡, 𝑑) × log ^𝑁

𝑑𝑓𝑡 (3)

2.8 Gaussian Naive Bayes combined with IndoBERT

The main idea of this combination approach is to extract the log probabilities value on each model from the classification process, then sum IndoBERT log probabilities value with Gaussian Naive Bayes log probabilities value. However, there is a different scale of log probabilities between IndoBERT and Gaussian Naive Bayes.

Therefore, data normalization will be implemented before summing the log probabilities so that the values are in the same range. The data normalization method that will use in this research is min-max normalization. Min-max normalization is a method for rescaling data into a range of [0, 1] or [-1, 1] [23]. Summing log probabilities value will be implemented after the data normalization process. The flow chart for this combined approach is illustrated in Figure 4.

Figure 4. Combined Model Flowchart 2.9 Evaluation

Evaluation is a step to analyze whether or not the system built has good predictive results. In this research, performance will be measured through the score of accuracy, precision, and recall. Where this performance metric

(5)

Ni Made Dwipadini Puspitarini, Copyright © 2023, MIB, Page 271 will be calculated based on the confusion matrix that contains values of true positive (TP), false positive (FP), true negative (TN), and false negative (FN). These four combination values are shown in Table 2.

Table 2. Confusion Matrix

Predicted Label

Positive Negative

Actual Label Positive TP FN

Negative FP TN

a. Accuracy

Accuracy is a value that describes how accurately the system can correctly predict and classify personality.

Accuracy calculates the ratio of correctly predicted predictions to the entire data. The equation formula for accuracy is shown in equation (4).

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^{𝑇𝑃+𝑇𝑁}

𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 (4)

b. Precision

Precision is a value that describes the ratio of true positive predictions to all data that is positive. The equation formula for precision is shown in equation (5).

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ^𝑇𝑃

𝑇𝑃+𝐹𝑃 (5)

c. Recall

Recall is a value that describes the ratio of true positive predictions to all data that is actually true. The equation formula for recall is shown in equation (6).

𝑅𝑒𝑐𝑎𝑙𝑙 = ^𝑇𝑃

𝑇𝑃+𝐹𝑁 (6)

3. RESULT AND DISCUSSION

This research contains three main process to gain knowledge about the proposed combined model, first, we look at the Gaussian Naive Bayes performance as a conventional machine learning for Big Five personality detection.

Second, we also check IndoBERT performance as deep learning to detect Big Five personality. These two processes were essential to conduct it first, so that we can find out whether the last process will provide an uplift from the performance results given to these two processes. As the third process, we combined both models by summing the log probability value on each model. Total dataset size for this research is 3.342 data collected from 111 Twitter accounts. The same k value using the K-Fold cross-validation approach was used to train all models, which is k = 10.

Table 3. Gaussian Naive Bayes Performance k Accuracy Precision (%) Recall (%)

(%) O C E A N O C E A N

1 60 100 59 61 78 46 24 57 87 60 72

2 55 100 61 46 76 47 8 67 81 46 77

3 54 75 40 47 91 47 10 79 93 55 78

4 63 93 66 69 68 43 28 72 90 44 85

5 59 79 63 60 77 36 35 50 83 57 79

6 57 83 60 51 63 51 18 63 87 51 61

7 52 90 48 56 62 39 15 71 85 41 76

8 62 57 67 60 85 54 14 61 82 47 84

9 56 100 60 32 74 56 21 71 74 57 83 10 57 83 56 57 72 44 23 67 76 50 76

Table 4. Precision and Recall for Each Label in GNB

O C E A N

Precision (%) 86 57 54 74 46 Recall (%) 19 65 84 51 77

The result of the first process is shown in Table 3. For training the Gaussian Naive Bayes model, a variable smoothing parameter was included to optimize the performance. The smoothing variable 1.0 was selected. This value was obtained from the grid search technique. Furthermore, this research also implemented TF-IDF for

(6)

Gaussian Naive Bayes as a feature extraction method. Overall, the Gaussian Naive Bayes accuracy score after training ten times is 57.51%. This score means that this model still seems to have trouble predicting three of five labels in the Big Five personality. From the detailed result in Table 4, we discovered that Extraversion is the most accessible label to predict, while Openness is the hardest. As mentioned before, we also find out that this model is having trouble predicting three of five labels, which are Openness, Conscientiousness, and Agreeableness.

Meanwhile, this model can predict well for Extraversion and Neuroticism. In addition, the highest accuracy achieved for Gaussian Naive Bayes is 63%, which happens when k = 4. The contributors to boost the accuracy score in this fold are Conscientiousness, Extraversion, and Neuroticism. Meanwhile, the lowest score for this model happens when k = 7, with an accuracy score of 52%. This happened because the model performed worst in predicting Openness personality.

Table 5. IndoBERT Performance

k Accuracy Precision (%) Recall (%) (%) O C E A N O C E A N 1 59 50 46 71 68 56 53 52 73 55 62 2 51 41 51 59 62 41 39 64 54 58 40 3 62 51 83 63 65 58 73 41 57 82 57 4 66 54 56 78 59 81 64 49 87 57 63 5 64 65 61 69 67 57 51 64 69 72 57 6 57 54 55 63 68 44 44 60 63 55 67 7 66 58 58 78 59 80 76 49 84 51 67 8 66 75 53 77 61 67 54 65 70 59 79 9 62 60 49 70 68 67 62 57 50 61 80 10 61 60 70 62 60 55 52 51 79 62 66

Table 6. Precision and Recall for Each Label in IndoBERT

O C E A N

Precision (%) 56 57 70 64 59 Recall (%) 58 55 70 61 63

For the second process, we used a batch size of 8 and 3 epochs to train the IndoBERT model. Table 5 shows IndoBERT performance for Big Five Personality detection. After this model was trained ten times using K-fold cross-validation, the achieved accuracy score for the entire dataset was 61.24%. This score means that, in this research IndoBERT offers better performance in predicting Big Five personality when compared to Gaussian Naive Bayes. From Table 6, we discovered that Extraversion is still the easiest to predict, while Conscientiousness is the hardest. IndoBERT's highest accuracy score is 66%, which happens when k equals 4, 7, and 8. However, the lowest accuracy score was achieved when k = 2 with a value of 51%.

According to previous studies, it has been proven that combining machine learning with deep learning can optimize the model's performance. Therefore, in the third process, we combine Gaussian Naive Bayes with IndoBERT based on the previously obtained result stated in Table 3 and Table 5. The main idea of this proposed combined model is summing the Gaussian Naive Bayes log probability vector with IndoBERT’s in every iteration to produce the results of the combined model. To equalize the data range, we implemented data normalization using a min-max scaler before adding the log probability value. Afterward, the highest probability value in the same row will be designated as the predicted personality class. The accuracy score between Gaussian Naive Bayes, IndoBERT, and combined model on every iteration is shown in Table 7.

Table 7. Comparison of Accuracy Score on Each Iteration

k Accuracy (%)

Gaussian NB IndoBERT Combined Model

1 60 59 63

2 55 51 54

3 54 62 63

4 63 66 68

5 59 64 66

6 57 57 59

7 52 66 67

8 62 66 68

9 56 62 62

10 57 61 61

(7)

Ni Made Dwipadini Puspitarini, Copyright © 2023, MIB, Page 273 The results of the third process in Table 7 shows that the implementation of combining Gaussian Naive Bayes and IndoBERT by summing the log probability value almost successfully provides uplift for accuracy score in every iteration. Only 1 out of 10 experienced a decrease in accuracy score, which happens when k = 2. Based on the overall result in Table 8, it can be seen that the proposed combined model can increase the overall accuracy score for the entire dataset. Overall, the accuracy score for the combined model is 62.93%, 5.42% higher than Gaussian Naive Bayes and almost 2% higher than IndoBERT.

Table 8. Comparison of Overall Accuracy Score Model Accuracy (%) Gaussian NB 57.51

IndoBERT 61.24

Combined Model 62.93

From the combined model’s result that we obtained above, we find out that some personalities have a resemblance with each other. Based on our findings, we discovered that conscientiousness is similar to agreeableness, which makes them have a tendency toward each other. These two personalities express positive emotions, persistence, enthusiasm, hard work, trust, and generosity. Furthermore, openness and neuroticism also have a similarity since openness is expressing disagreement, anger, and negative emotions. Thus, its tweets tend to be gloomy and less cheerful. Meanwhile, Extraversion tweets are the most expressive. It pours out joy, happiness, and laughter, making this personality significantly different from the other four personalities. This discovery aligns with the results listed in Table 4 and Table 6 that Extraversion is the easiest personality to predict.

Therefore, a further experiment is needed to find out whether our combined model will still provide an uplift for the accuracy score if we combine the personalities that have a resemblance. According to the explanation above, we will combine conscientiousness with agreeableness, openness with neuroticism, and extraversion will not be combined with any personalities. The accuracy score for this experiment between Gaussian Naive Bayes, IndoBERT, and combined model on every iteration is shown in Table 9.

Table 9. Comparison of the Experiment’s Accuracy Score

k Accuracy (%)

Gaussian NB IndoBERT Combined Model

1 55 71 75

2 49 64 65

3 63 69 73

4 63 75 76

5 65 73 78

6 59 66 67

7 56 73 72

8 59 74 75

9 56 78 77

10 59 71 75

From Table 9, shows that our combined model still successfully provides uplift for accuracy score almost in every iteration if we combine the personalities that have a resemblance. Moreover, the highest achieved uplift for this experiment happens when k = 5 with an accuracy score of 78%, which means that 5% better than IndoBERT and 13% better than Gaussian Naive Bayes. Based on the overall result in Table 10, it can be seen that our proposed combined model can still increase the overall accuracy score for the entire dataset if we experiment to merge the resemblance personalities. Overall, the accuracy score for the combined model in this experiment is 73.08%, almost 15% higher than Gaussian Naive Bayes and almost 2% higher than IndoBERT.

Table 10. Comparison of the Overall Experiments’s Accuracy Score Accuracy (%)

Gaussian NB 58.45

IndoBERT 71.24

Combined Model 73.08

4. CONCLUSION

The proposed of this research is to combine machine learning with deep learning, which is Gaussian Naive Bayes and IndoBERT, to detect Big Five personalities based on Indonesian tweets from Twitter users. According to the result and discussion explanation, with total data of about 3.342 tweets collected from 111 users, this proposed combined model produces an accuracy score uplift for the entire dataset. This model also offers better performance

(8)

results than only using Gaussian Naive Bayes or IndoBERT method. The implementation of a combination of both models by summing the log probability vector is the solution to increase the model performance using a min-max scaler since the probability range between those models are not the same. The combined model shows better accuracy in almost every iteration than Gaussian Naive Bayes and IndoBERT. Overall, the accuracy score for the combined models is better than those models for both five and three labels, indicating that the combined model successfully improves the personality detection system. For upcoming research, use larger dataset can be considered to train the model so that it can predict the big five personalities more accurately. Also, combine other machine learning and deep learning algorithms to create an even better personality detection system.

REFERENCES

[1] N. Aiyuda and N. A. Syakarofath, “Presentasi diri di sosial media (Instagram dan Facebook),” PSYCHOPOLYTAN (Jurnal Psikologi), vol. 2, 2019, Accessed: Dec. 19, 2022. [Online]. Available:

http://jurnal.univrab.ac.id/index.php/psi/article/view/915

[2] M. M. Tadesse, H. Lin, B. Xu, and L. Yang, “Personality Predictions Based on User Behavior on the Facebook Social Media Platform,” IEEE Access, vol. 6, pp. 61959–61969, 2018, doi: 10.1109/ACCESS.2018.2876502.

[3] T. Tony and Y. Taufik, “Pengaruh Kepribadian dan Pengalaman Kerja terhadap Kompetensi Kerja Karyawan PT. Era Musika Indah Medan,” Lensa Ilmiah: Jurnal Manajemen dan Sumberdaya, vol. 1, no. 1, pp. 88–93, Aug. 2022, doi:

10.54371/jms.v1i1.187.

[4] P. S. Dandannavar, S. R. Mangalwede, and P. M. Kulkarni, “Social Media Text - A Source for Personality Prediction,”

in 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), Dec.

2018, pp. 62–65. doi: 10.1109/CTEMS.2018.8769304.

[5] M. Villeda and R. McCamey, “Use of Social Networking Sites for Recruiting and Selecting in the Hiring Process,”

International Business Research, vol. 12, no. 3, p. 66, Feb. 2019, doi: 10.5539/ibr.v12n3p66.

[6] M. Ichsanudin, A. S. Y. Irawan, and A. Solehudin, “Prediksi Kepribadian Berdasarkan Media Sosial Twitter Menggunakan Metode Naive Bayes Classifier,” Jurnal Sains Komputer & Informatika (J-SAKTI), vol. 5, no. 2, pp. 988–

996, 2021, doi: http://dx.doi.org/10.30645/j-sakti.v5i2.394.

[7] Y. B. N. D. Artissa, I. Asror, and S. A. Faraby, “Personality Classification based on Facebook status text using Multinomial Naive Bayes method,” J Phys Conf Ser, vol. 1192, no. 1, p. 012003, Mar. 2019, doi: 10.1088/1742- 6596/1192/1/012003.

[8] Yusra, M. Fikry, R. Syarfianto, R. Mai Candra, and E. Budianita, “Klasifikasi Kepribadian Big Five Pengguna Twitter dengan Metode Naive Bayes,” Seminar Nasional Teknologi Informasi, Komunikasi dan Industri (SNTIKI-10), no.

November, pp. 2579–5406, 2018.

[9] A. v. Kunte and S. Panicker, “Using textual data for Personality Prediction:A Machine Learning Approach,” in 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Nov. 2019, pp. 529–533. doi:

10.1109/ISCON47742.2019.9036220.

[10] B. Singh and S. Singhal, “Automated Personality Classification Using Data Mining Techniques,” SSRN Electronic Journal, 2020, doi: 10.2139/ssrn.3602540.

[11] H. Ahmad, M. U. Asghar, M. Z. Asghar, A. Khan, and A. H. Mosavi, “A Hybrid Deep Learning Technique for Personality Trait Classification From Text,” IEEE Access, vol. 9, pp. 146214–146232, 2021, doi:

10.1109/ACCESS.2021.3121791.

[12] G. D. Salsabila and E. B. Setiawan, “Semantic Approach for Big Five Personality Prediction on Twitter,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 4, pp. 680–687, Aug. 2021, doi: 10.29207/resti.v5i4.3197.

[13] I.-A. Albu and S. Spînu, “Emotion Detection From Tweets Using a BERT and SVM Ensemble Model,” U.P.B. Sci.

Bull., Series C, vol. 84, no. 1, p. 2022, Aug. 2022, [Online]. Available: http://arxiv.org/abs/2208.04547

[14] S. Liu, H. Tao, and S. Feng, “Text Classification Research Based on Bert Model and Bayesian Network,” in 2019 Chinese Automation Congress (CAC), Nov. 2019, pp. 5842–5846. doi: 10.1109/CAC48633.2019.8996183.

[15] H. Ning, S. Dhelim, and N. Aung, “PersoNet: Friend Recommendation System Based on Big-Five Personality Traits and Hybrid Filtering,” IEEE Trans Comput Soc Syst, vol. 6, no. 3, pp. 394–402, Jun. 2019, doi:

10.1109/TCSS.2019.2903857.

[16] W. Maharani and V. Effendy, “Big five personality prediction based in Indonesian tweets using machine learning methods,” International Journal of Electrical and Computer Engineering (IJECE), vol. 12, no. 2, p. 1973, Apr. 2022, doi: 10.11591/ijece.v12i2.pp1973-1981.

[17] R. R. R. Arisandi, B. Warsito, and A. R. Hakim, “APLIKASI NAIVE BAYES CLASSIFIER (NBC) PADA KLASIFIKASI STATUS GIZI BALITA STUNTING DENGAN PENGUJIAN K-FOLD CROSS VALIDATION,”

Jurnal Gaussian, vol. 11, no. 1, pp. 130–139, May 2022, doi: 10.14710/j.gauss.v11i1.33991.

[18] S. Prusty, S. Patnaik, and S. K. Dash, “SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer,” Frontiers in Nanotechnology, vol. 4, Aug. 2022, doi: 10.3389/fnano.2022.972421.

[19] F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney, and F. Rudzicz, “A survey of word embeddings for clinical text.,” J Biomed Inform, vol. 100S, no. October, p. 100057, 2019, doi: 10.1016/j.yjbinx.2019.100057.

[20] R. Y. Rumagit and A. S. Girsang, “Predicting personality traits of facebook users using text mining,” J Theor Appl Inf Technol, vol. 96, no. 20, pp. 6877–6888, 2018.

[21] M. B. Ressan and R. F. Hassan, “Naive-Bayes family for sentiment analysis during COVID-19 pandemic and classification tweets,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 28, no. 1, p. 375, Oct.

2022, doi: 10.11591/ijeecs.v28.i1.pp375-383.

[22] O. I. Gifari, Muh. Adha, F. Freddy, and F. F. S. Durrand, “Analisis Sentimen Review Film Menggunakan TF-IDF dan Support Vector Machine,” Journal of Information Technology, vol. 2, no. 1, pp. 36–40, Mar. 2022, doi:

10.46229/jifotech.v2i1.330.

(9)

Accuracy,” Research Papers Faculty of Materials Science and Technology Slovak University of Technology, vol. 27, no. 45, pp. 79–84, Sep. 2019, doi: 10.2478/rput-2019-0029.