Single-Label and Multi-Label Text Classification using ANN and Comparison with Naïve Bayes and SVM

(1)

Single-Label and Multi-Label Text Classification using ANN and Comparison with Naïve Bayes and SVM

M Mahfi Nurandi Karsana, Kemas Muslim L, Widi Astuti^* Faculty of Informatics, Telkom University, Bandung, Indonesia

Email: ¹[email protected], ²[email protected],

3,*[email protected]

Email Penulis Korespondensi: [email protected]

Abstract−Machine learning has become useful in daily life thanks to improvements in machine learning techniques. Text classification as an important part in machine learning. There are already many methods used for text classification such as Artificial Neural Network (ANN), Naïve Bayes, SVM, Decision Tree etc. ANN is a branch in machine learning which approximate the function of natural neural network. ANN have been used extensively for classification. In this research a simple architecture of ANN is used. But it needs to be pointed out that the architecture used in this research is relatively simple compared to the cutting edge in ANN development and research to show the potential that ANN have compared to other classification method. ANN, Naïve Bayes and SVM performance are measured using f1-macro. Performance of classification model is measured of multiple single-label and multi-label dataset. This research found that in single-label classification ANN have a comparable f1-macro with 0.79 compared to 0.82 for SVM. In multi-label classification ANN have the best f1-macro with 0.48 compared to 0.44 in SVM.

Keywords: ANN; F1-Macro; Naive Bayes; Text Classification; SVM

1. INTRODUCTION

Text classification is a process for classifying documents into different categories [1]. Text classification has many functions, including topic classification, sentiment analysis [2], spam detection, and information filtering [3]. In machine learning, the text classification process can be divided into four general sections: feature extraction, dimension reduction, classifier selection, and evaluation [4]. In the feature extraction process, text data will be converted into a form that can be used on a computer. For classification, many methods have been developed, such as Naïve Bayes, Artificial Neural Network (ANN), Decision Tree, Support Vector Machine, and other algorithms that can be used for classification.

ANN mimic the behavior of living brain to solve problem [5]. The ANN that will be used in this study is relatively simple compared to today’s latest ANN technology. By using a simple ANN, it is expected to show the potential of this method in the field of machine learning. ANN can be described as mainly consists of three different layer, which are input, hidden, and output [6].

Naive Bayes is a classification method assuming that the variables are independent. This assumption is the basis of the Naive Bayes classification process. Naive Bayes was developed as supervised learning based on the theory by Thomas Bayes [7][8]. This learning model assumes that the variables are independent [9][10]. For example, class c in a set of class C has a feature t where t is a subset of the feature set T. So, the probability of class c if it has a feature t, will follow Bayes’ theory in formula 1[11]. For example, class c in a class set of C has a feature t where t is a subset of the feature set T. So, the probability of class c, if it has a feature t, will follow Bayes’ theory in formula 1 [11].

Pr(c|t) =Pr(c)Pr⁡(t|c)

Pr⁡(t) (1)

Pr(c) is the probability of class c, and Pr(t|c) is the probability of feature t in class c. Pr(t) is the probability of the t feature. In this study, the implementation of multinomial naive Bayes uses the scikit-learn library. The simple naive Bayes learning model has been found to perform well in single label classification.

One of the methods used besides ANN and Naïve Bayes in this study is SVM. SVM is a simple supervised machine learning method, SVM techniques can be used for classification and regression [12]. SVM is a machine learning technique based on Vapnik in research entitled Support Vector . The SVM classification process uses a hyperplane to separate data into two parts [13][10]. This machine learning model can only carry out binary classification [14]. So for classification with more than two classes, the One vs. One (OVO) or One vs. Rest (OVR) method will be used.

In the classification process using the SVM method, several factors affect classification performance, such as dimension reduction, kernel, and also the preprocessing process carried out [15]. Kernel selection is essential to get optimal model performance and to be able to classify nonlinearly [7]. So a good kernel is a kernel that can produce good performance but does not have high complexity. In this study, the SVM implementation used is from scikit-learn library.

There have been many studies on text classification in multi-label and multi-class sections. Many previous studies have tried combining techniques to achieve better classification performance. In previous research by Kamran Kowsari on 2019, a survey was conducted on techniques for text classification [4]. This survey explained

(2)

that general text classification could be broken down into four steps: feature extraction, dimension reductions, classifier selection, and evaluation [4]. Research on by T. B. Shahi on 2019 using Naïve Bayes, SVM, and Neural Network for text classification on the news from Nepal found that the SVM classification method has an f1-score of 74%, the Naïve Bayes method has an f1-score of 67.2%, and the Neural Network with Multi-Layer Perceptron architecture has an f1-score of 72.2% [16]. It can also be seen that there is a substantial difference between English and Nepali language, but the text classification process is not much different.

ANN is an essential technique for classification, pattern recognition, clustering, and prediction tasks on various subjects [6]. The strengths of the ANN machine learning model are the ease of use and the model being more accurate for multiple inputs. Generally, a neural network consists of three different layers: the input, hidden, and output layers. The hidden layer is the most significant part of the neural network. However, ANN design requires high expertise in the ANN design process. Previous research by Yuliana et al on 2019 in text classification using Neural Networks for Indonesian text document classification comprises ten document classes [17]. The classification carried out in this research is on multi-class classification and uses accuracy and error rate as the performance metric.

Previous research by Haryanto et al on 2018 in text classification using SVM found an accuracy of 81.08%

on text classification with the BBC dataset using the SVM technique. This study also observed an increase in accuracy performance of around 15% by performing dimension reduction using the chi-square technique [15].

Research by Kim et al on 2018 regarding Multinomial Naïve Bayes in text classification found an f1 macro performance of 0.8745on a dataset of 20 newsgroups[11].Several problems in the multi-label text classification process, one of which is the uncertainty in the label prediction process [18]. Due to the number of variables that must be predicted in multi-label classification, the classification process is more complicated than in single-label classification. It was also explained that the neural network model is prone to overfitting.

In this study, the questions that arise in the classification process are focused on measuring the performance of ANN compared to the Multinomial Naïve Bayes and SVM methods. ANN was appointed as a classification method because ANN is a technique that has a wide field of use. ANN has many uses ranging from classification, regression, and also data generation. The data used are 20newsgroups, BBC Full-Text Classification, and Elsevier OA CC-BY Corpus, with the first two datasets being single-label data and Elsevier OA CC-BY Corpus being multi-label datasets. The performance of the built model is measured using the f1-score and k-fold to validate the results of the created model. The performance analysis of the built model will use the F1-Score metric proposed by Rijsbergen. Specifically, it will use the f1-macro metric to compare model performance in this study. The machine learning models used in this study are ANN, Naïve Bayes, and SVM. The research aims to assess ann’s effectiveness as a text classification technique. By using ANN and TF-IDF, we want to know the performance of ANN in text classification. The research was compiled based on the provisions of the written organization, namely Abstract, Introduction, Research Methodology, Results, and Discussion and Conclusions.

2. RESEARCH METHODOLOGY

The system built is an ANN model, which will be compared with the Naive Bayes and SVM models. Each model will be trained with the same dataset then performance will be measured using the f1-macro metric. Modeling in the system is described in the system flow chart on figure 1.

Figure 1. System Flow

The experiment done in this study is composed of several steps, preprocessing, feature extraction, training, and evaluation. In preprocessing, the dataset is cleaned through several processes, such as case folding, stopword removal, tokenizing, and stemming. The next step is feature extraction which is to extract features from a

(3)

previously cleaned dataset. The training step is to train the algorithm on the training dataset, a subset of the previously mentioned dataset. The last step is to evaluate the machine learning algorithm using f1-macro as a metric to measure generalized performance.

2.1. Dataset

The dataset used in this study are two single-label datasets and a multi-label dataset. The first single-label dataset used comes from previous research [19] with the names of 20 newsgroups. The dataset consists of 18821 rows of data and consists of 20 classes. The single-label dataset used besides the 20 newsgroups is the BBC dataset from previous research [20]. This dataset consists of 2225 rows of data consisting of 5 classes. For text, attributes are tightly bound in categories from class classifications. As additional data, the Elsevier OA-CC multi-label dataset is also used [21]. The dataset consists of 40000 rows of data and 27 separate classes. In a multi-label dataset, each data can have more than one label. The use of this multi-label dataset aims to find out how effective a machine learning model is built on a similar but more complex classification. This dataset is used to ensure that models trained on multi-label datasets have comparable performance to models using single-label datasets.

In the process of making the model, the dataset will be divided into five folds for the cross-validation process. The data above will be divided into three distinct part which are test, validation, and training data.

2.2. Preprocessing

The data used is unstructured text data. At this stage, the data will be cleaned. This process is carried out to produce better performance [22]. The expected result of this process is structured data. Several processes will be carried out at this stage in the form of case folding, stopword removal, tokenizing, and stemming.

a. Case Folding

In this step, the text data received will be converted into lowercase because the word’s meaning will not change.

An example can be seen in Table 1.

Table 1. Example of Case Folding

No Input Output

1 The boys thought it was quite remarkable that the oxen never turned north throughout the days

the boys thought it was quite remarkable that the oxen never turned north throughout the days 2 The fir tree has a comforting smell as it sways in the

wind

the fir tree has a comforting smell as it sways in the wind

3 My house is ablaze because I forgot to turn off the oven

my house is ablaze because i forgot to turn off the oven

4 Grandma loves soft bread and tea, though she hates eating meat given her age

grandma loves soft bread and tea, though she hates eating meat given her age

5 The three-legged goat running across the irradiated field

the three-legged goat running across the irradiated field

b. Stopword Removal

This step will remove the stopword from the input sentence. Examples of stopwords are from, and, too, is, and others. An example of this process can be seen in Table 2.

Table 2. Example of Stopword Removal

1 the boys thought it was quite remarkable that the oxen never turned north throughout the days

boys thought quite remarkable oxen never turned north throughout days

2 the fir tree has a comforting smell as it sways in the wind fir tree comforting smell sways wind 3 my house is ablaze because i forgot to turn off the oven house ablaze forgot turn oven 4 grandma loves soft bread and tea, though she hates eating

meat given her age

grandma loves soft bread tea, though hates eating meat given age

5 the three-legged goat running across the irradiated field three-legged goat running across irradiated field

c. Tokenizing

This step separates the sentence input into a collection of independent tokens [26]. An example can be seen in Table 3.

Table 3. Example of Tokenizing

1 boys thought quite remarkable oxen never turned north throughout days

'boys', 'thought', 'quite', 'remarkable', 'oxen', 'never', 'turned', 'north', 'throughout', 'days'

(4)

2 fir tree comforting smell sways wind 'fir', 'tree', 'comforting', 'smell', 'sways', 'wind' 3 house ablaze forgot turn oven 'house', 'ablaze', 'forgot', 'turn', 'oven'

4 grandma loves soft bread tea, though hates eating meat given age

'grandma', 'loves', 'soft', 'bread', 'tea', 'though', 'hates', 'eating', 'meat', 'given', 'age'

5 three-legged goat running across irradiated field

'three', 'legged', 'goat', 'running', 'across', 'irradiated', 'field'

d. Stemming

This step will simplify the word into a uniform form. An example can be seen in Table 4.

Table 4. Example of Stemming

1 'boys', 'thought', 'quite', 'remarkable', 'oxen', 'never', 'turned', 'north', 'throughout', 'days'

'boy', 'thought', 'quit', 'remark', 'oxen', 'never', 'turn', 'north', 'throughout', 'day'

2 'fir', 'tree', 'comforting', 'smell', 'sways', 'wind' 'fir', 'tree', 'comfort', 'smell', 'sway', 'wind' 3 'house', 'ablaze', 'forgot', 'turn', 'oven' 'hous', 'ablaz', 'forgot', 'turn', 'oven' 4 'grandma', 'loves', 'soft', 'bread', 'tea', 'though', 'hates',

'eating', 'meat', 'given', 'age'

'grandma', 'love', 'soft', 'bread', 'tea', 'though', 'hate', 'eat', 'meat', 'given', 'age'

5 'three', 'legged', 'goat', 'running', 'across', 'irradiated', 'field'

'three', 'leg', 'goat', 'run', 'across', 'irradi', 'field'

2.3. Feature Extraction

The method used for feature extraction in this experiment is Term Frequency Inverse Document Frequency (TF- IDF), which is one of the schemes used feature extraction process [3]. TF-IDF will measure a word’s occurrence in a specific document against that word’s occurrence in all existing document. To produce the relevance of the term to all documents used. The weakness of this approach is the resulting feature size is comparable to the unique terms in the corpus.

2.4. Model Performance Measurement

The evaluation carried out to measure the performance and accuracy of the system will use the f1-score. F1-score is used to balance the recall and precision.

Table 5. Confusion Matrix

Predicted True Predicted False

Actual True TP FN

Actual False FP TN

Information :

True Positive (TP) : Predictions and actual data are positive.

False Negative (FN) : Predictions are negative, while real data is positive.

False Positive (FP) : Predictions are positive, while real data is negative.

True Negative (TN) : Predictions and real data are negative.

a. Accuracy

Accuracy is the ratio of predictions that match real data compared to overall predictions. The function of the accuracy value of the following equation 2:

accuracy = ^TP+TN

TP+FN+FP+TN (2)

b. Precision

Precision is the ratio of accurate predictions compared to all true predictions made. The function of the precision value of the following equation 3:

precision = ⁡ ^TP

TP+FP (3)

c. Recall

Recall is the ratio of predictions that match real data compared to True Positive (TP) and False Negative (FN) predictions. The function of the recall value is the following equation 4:

recall = ⁡ ^TP

TP+FN (4)

d. F1-Score

F1-score is the harmonic average of precision and recall [23]. The function of the f1-score value is the following equation 5:

(5)

f1⁡score = ⁡2∙recall∙precision

precision+recall (5)

In the multiclass classification, the f1-score will be averaged. One of the methods used to get the average f1- score is as follows in equation 6.

f1⁡macro = ^Σf1⁡score

num⁡of⁡class (6)

3. RESULT AND DISCUSSION

3.1.1 Experiment Scenario

The artificial neural network model that is built will have a simple architecture consisting of 2 types of layers, namely dense and BatchNormalization. Then the neural network model will be trained using the same dataset as the training dataset used in the Naive Bayes and SVM models. The experiment has several steps. The first is to clean the dataset, then extract features using TF-IDF. After that then leads to splitting the dataset using k-fold cross-validation to generalize the data used in training and testing the machine learning model. The performance of the machine learning model is measured using F1-Macro as a metric.

The first step in this experiment is to clean the dataset. The dataset is cleaned through previously stated processes, in this order case folding, stopword removal, tokenizing, and stemming. In which the data from the dataset is cleaned to remove noise and useless word in the dataset [24]. The first process in cleaning the dataset is case folding, in which the data in the dataset is converted to lowercase, removing uppercase character that does not change the meaning of words. Then stopwords such as the, from, too, and is removed in the stopword removal step. This process is simply tried to match a list of stopwords to the data; when a match is found, the word from the dataset will be removed. In tokenizing, the data is split into separate words. For the step above, NLTK library is used.

The last step in cleaning is stemming, used to reduce words in the dataset to the word stem[24].Stemmer without lexicon is prone to stemming errors, since the stemmer does not understand the context of the word [24].

Homonym word with different meaning but have the same word will be reduced to the same word stem when stemmed using stemmer without lexicon. Stemming in this experiment uses porter stemmer, which is a type of suffix-removal algorithm [24]. Porter stemmer is a widely used stemming algorithm. The stemming process using porter stemmer is based on suffix in the English language is composed of another simpler suffix. Porter stemmer has become the standard approach to stemming. It was created in 1980 by Martin Porter [24]. Porter stemmer used in this experiment is from NLTK library, the same library used for cleaning datasets.

The next step is to extract feature present in the text dataset. This step converts a word to a fraction using Term Frequency Inverse Document Frequency (TF-IDF). TF-IDF will measure a word’s occurrence in a document against that word’s occurrence in all existing documents to produce the relevance of the term to all documents used [3]. TF-IDF consists of two-part Term Frequency and Inverse Document Frequency. Term frequency refers to the frequency of a term in a document, and inverse term Frequency refer to the inverse of the frequency of a term in all document. In this experiment, the TF-IDF implementation that will be used is from scikit learn library which is written using Python programing language.

After the data is cleaned and its feature extracted, the dataset will be split into three parts, which is train, validation, and test dataset. In this experiment, to ensure no noise in the experiment result Cross Validation. K- Fold Cross validation is used to determine robustness of the machine learning model, and its classification success on new or novel classification [25].This is achieved by splitting the dataset in different points so that each fold which are train, validation, and test, each part have different data. In this way, each fold has different data to train and different unseen data each time. By shuffling the subset of the dataset using k-fold cross-validation and averaging model test results, the model performance will be averaged between all seen and unseen subsets of the dataset. This process will give a better view of model performance and remove randomness between different machine learning algorithm runs. F1 Macro is used to be a metric in measuring the machine learning model in this experiment. The f1 score for each class will be calculated independently based on equation 6 and then divided by the number of classes in the dataset. Before the f1 score for each class is calculated, the cross-validation result will be averaged to get an average confusion matrix. After that, each class f1 score is calculated using equation 5. Each class f1 score was then averaged to get the f1 macro for the whole dataset. Precision and recall indirectly show how many false positives and false negatives in the classification result.

3.1.2 Test Results Analysis

The result from the experiment is in Table 6, which shows precision, recall, and f1-macro. Naive Bayes and SVM will use the same dataset to test the ANN algorithm. The dataset used is 20 newsgroups and BBC for single-label classification and the Elsevier OA CC-By Corpus dataset for multi-label classification. While f1 macro show overall performance for the machine learning model, precision, and recall. Precision and recall based on equations 3 and 4 are mainly true positive while considering false positive and true negative. It depends on the task Precision

(6)

or recall might be a more important metric to measure performance since false positive or false negative can be critical in a specific task. However, a generalized performance metric is needed for this task to analyze the differences between ANN, Naïve Bayes, and SVM.

Table 6. Experiment Result

20newsgroup BBC Elsevier OA CC-By Corpus

Precisio n

Recall F1- Macro

Precisio n

Recal l

F1- Macro

Precisio n

Recall F1- Macro

ANN 0.79 0.78 0.79 0.96 0.96 0.96 0.55 0.42 0.48

Naïve Bayes

0.80 0.78 0.78 0.97 0.97 0.97 0.49 0.29 0.34

SVM 0.83 0.82 0.82 0.97 0.97 0.97 0.33 0.72 0.44

The test result from Table 6 shows that in single label classification in all metric SVM, Naïve Bayes and ANN do not have a significant difference. In single-label classification, precision and recall do not have significant differences in all machine learning algorithms based on the result that the precision and recall that false positive for precision and true negative on recall have similar values. For single-label classification, it is found that the performance of the overall metric SVM algorithm is better than the performance of the ANN algorithm.

Considering the difficulty in multi-label classification, as stated in the introduction, the expected result will be lower than single-label classification. However, the performance drop is significant in multi-label classification compared to a comparably sized single-label dataset. The dataset used in multi-label classification is Elsevier OA CC-by corpus, which has around 40 thousand rows of data. The most extensive dataset used in single-label classification for this experiment is 20newsgroup, which consists of around 18 thousand rows of data. The experiment shows some performance drops when training the machine learning model using BBC, which is the smaller single-label dataset used. The experiment result also showed that the multi-label classification has significantly worse results compared to single-label classification considering the difficulty of multi-label classification. SVM models in multi-label classification have a significantly better recall compared to the precision and terrible performance compared to single-label classification using the same SVM algorithm. In comparison, the Naïve Bayes model in multi-label classification is better at precision. ANN model has the slightest differences in precision and recall compared to Naïve Bayes and SVM models. Overall, ANN has the best performance on multi-label classification.

4. CONCLUSION

Based on the experiment using K-Fold Cross Validation text classification with ANN produces better classification performance results compared with the classification performance of naive Bayes and SVM. In single label classification using 20 newsgroups dataset, a similar result in f1-macro is observed for ANN, Naive Bayes, and SVM, with SVM having the best result in f1-macro score of 0.82 and ANN being the second best in single label classification with the f1-macro score in 0.79. On smaller BBC dataset, which is a single-label dataset with a significantly smaller data size compared to 20newsgroups at 2225 rows of data on five different classes, shows a considerable rise in the f1-macro score. Naive Bayes and SVM shows similar f1-macro score at 0.97 and ANN below with f1-macro score of 0.96. With multi-label text classification considering the difficulty for multi-label classification, f1-macro performance showed a significant drop for all machine learning algorithms tested. Using the same K-Fold Cross Validation method as single-label classification, ANN shows a better result with f1-macro score of 0.48 compared to 0.44 on SVM and 0.34 on Naive Bayes. On multi-label classification, other observations on Precision and recall in SVM tendencies can be made based on the Precision and Recall score, with SVM having a very high recall score compared to a low precision score, with 0.72 on Recall and 0.33 on Precision. This observation showed that SVM has a low false negative classification and a high false positive classification on multi-label text classification. Naive Bayes and ANN shows balanced Precision and recall with between 0.20 to 0.13 difference between Precision and Recall.

REFERENCES

[1] M. M. Mirończuk and J. Protasiewicz, “A recent overview of the state-of-the-art elements of text classification,” Expert Syst Appl, vol. 106, pp. 36–54, 2018.

[2] J. Zheng and L. Zheng, “A Hybrid Bidirectional Recurrent Convolutional Neural Network Attention-Based Model for Text Classification,” IEEE Access, vol. 7, pp. 106673–106685, 2019, doi: 10.1109/ACCESS.2019.2932619.

[3] R. Dzisevič and D. Šešok, “Text classification using different feature extraction approaches,” in 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), 2019, pp. 1–4.

[4] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, p. 150, 2019.

[5] A. Elnagar, R. Al-Debsi, and O. Einea, “Arabic text classification using deep learning models,” Inf Process Manag, vol.

57, no. 1, p. 102121, 2020.

(7)

[6] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, N. A. Mohamed, and H. Arshad, “State-of-the-art in artificial neural network applications: A survey,” Heliyon, vol. 4, no. 11, p. e00938, 2018.

[7] Q. Li et al., “A survey on text classification: From shallow to deep learning,” arXiv preprint arXiv:2008.00364, 2020.

[8] M. A. Ahmed, R. A. Hasan, A. H. Ali, and M. A. Mohammed, “The classification of the modern arabic poetry using machine learning,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 17, no. 5, pp. 2667–

2674, 2019.

[9] J. Kolluri and S. Razia, “Text classification using Na\"\ive Bayes classifier,” Mater Today Proc, 2020.

[10] A. I. Kadhim, “Survey on supervised machine learning techniques for automatic text classification,” Artif Intell Rev, vol.

52, no. 1, pp. 273–292, 2019.

[11] H. Kim, J. Kim, J. Kim, and P. Lim, “Towards perfect text classification with Wikipedia-based semantic Naive Bayes learning,” Neurocomputing, vol. 315, pp. 128–134, 2018.

[12] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, p. 150, 2019.

[13] X. Luo, “Efficient english text classification using selected machine learning techniques,” Alexandria Engineering Journal, vol. 60, no. 3, pp. 3401–3409, 2021.

[14] A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. “ O’Reilly Media, Inc.,” 2017.

[15] A. W. Haryanto, E. K. Mawardi, and others, “Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification,” in 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 229–233.

[16] T. B. Shahi and A. K. Pant, “Nepali news classification using Na\"\ive Bayes, support vector machines and neural networks,” in 2018 International Conference on Communication Information and Computing Technology (ICCICT), 2018, pp. 1–5.

[17] D. Yuliana and C. Supriyanto, “Klasifikasi Teks Pengaduan Masyarakat Dengan Menggunakan Algoritma Neural Network,” vol. 5, no. 3, pp. 92–116, 2019, doi: 10.29165/komtekinfo.v5i2.

[18] W. Chen, B. Zhang, and M. Lu, “Uncertainty quantification for multilabel text classification,” Wiley Interdiscip Rev Data Min Knowl Discov, vol. 10, no. 6, p. e1384, 2020.

[19] A. M. de J. C. Cachopo and others, “Improving methods for single-label text categorization,” Instituto Superior Técnico, Portugal, 2007.

[20] D. Greene and P. Cunningham, “Practical solutions to the problem of diagonal dominance in kernel document clustering,”

in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 377–384.

[21] D. Kershaw and R. Koeling, “Elsevier OA CC-By Corpus,” CoRR, vol. abs/2008.00774, 2020, [Online]. Available:

https://arxiv.org/abs/2008.00774

[22] G. Singh, B. Kumar, L. Gaur, and A. Tyagi, “Comparison between multinomial and Bernoulli na\"\ive Bayes for text classification,” in 2019 International Conference on Automation, Computational and Technology Management (ICACTM), 2019, pp. 593–596.

[23] D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, pp. 1–13, 2020.

[24] M. E. Polus and T. Abbas, “Development for performance of Porter Stemmer algorithm,” Eastern-European Journal of Enterprise Technologies, vol. 1, no. 2, p. 109, 2021.

[25] B. G. Marcot and A. M. Hanea, “What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?,” Comput Stat, vol. 36, no. 3, pp. 2009–2031, 2021.