Handling Imbalance Dataset on Hoax Indonesian Political News Classification using IndoBERT and Random Sampling
Muhammad Ammar Fathin*, Yuliant Sibaroni, Sri Suryani Prasetyowati School of Computing, Informatics, Telkom University, Bandung, Indonesia
Email: 1,*[email protected], 2[email protected], 3[email protected] Correspondence Author Email: [email protected]
Abstract−The rapid adoption of the internet in Indonesia, with over 200 million active users as of January 2022, has dramatically transformed information dissemination, particularly through social media and online platforms. These platforms, while democratizing information sharing, have also become hotbeds for the spread of misinformation and hoaxes, significantly impacting the political landscape, as seen in the Jakarta gubernatorial election from late 2016 to April 2017. Research by the Indonesian Telematics Society (MASTEL) revealed a high prevalence of hoax content, predominantly socio-political, underscoring the critical need to address this misinformation and hoaxes challenge. This research delves into the challenge of detecting hoaxes in Indonesian political news, particularly focusing on the classification of news as factual or hoax in the presence of class imbalances within datasets. The dataset exhibits a significant class imbalance with 6,947 articles identified as hoaxes and 20,945 as non-hoaxes, Utilizing the IndoBERT model, a specialized variant of the BERT framework pre-trained on the Indonesian language, the study aims to assess its effectiveness in discerning between factual and hoax news. This involves fine-tuning IndoBERT for specific text classification tasks and exploring the impact of various resampling techniques, such as Random Over Sampling and Random Under Sampling, to address class imbalances since the dataset, significantly imbalanced with 6,947 articles labeled as hoaxes and 20,945 as non-hoaxes, necessitated these approaches. The study's findings demonstrate the IndoBERT model's consistent accuracy across different resampling methods like Random Over Sampling (ROS) and Random Under Sampling (RUS), highlighting its effectiveness in handling imbalanced datasets produce the accuracy of hoax detection with the 98.2% accuracy, 97.5% Recall, 97.8% F1-score, and 97.2% Precision. This is particularly relevant for tasks like misinformation detection, where data imbalance is common. The success of IndoBERT, a language- specific BERT model, in text classification for the Indonesian language contributes to the understanding of BERT-based models in diverse linguistic contexts.
Keywords: Hoax Detection; IndoBERT; Imbalanced Data; Political News; BERT
1. INTRODUCTION
The internet's influence on daily life has become undeniable, especially in Indonesia, where its rapid adoption has resulted in significant shifts in information dissemination [1]. Indonesia has one of the highest numbers of internet users in the world, with over 200 million active users as of January 2022 [2]. The number of internet users in Indonesia has increased significantly in recent years, due to rapid development of technology, digital infrastructure, and services [3]. The internet has evolved into an incredibly swift and convenient medium for accessing information. In the digital era, the proliferation of social media and online platforms has significantly transformed the landscape of information dissemination, creating a double-edged sword. On one hand, these platforms have democratized information sharing, empowering individuals to voice their opinions and access a wealth of knowledge. On the other hand, they have also become fertile grounds for the spread of misinformation and hoaxes, particularly in politically charged environments. The Indonesian political scene has not been immune to this phenomenon, especially during critical periods such as elections [4], where the rapid spread of hoaxes can have tangible impacts on democratic processes and public opinion.
In Indonesia, the impact of hoaxes has been particularly notable in the political domain. The Jakarta gubernatorial election from late 2016 to April 2017 exemplifies this, with hoaxes being strategically used to undermine candidates. Research by the Indonesian Telematics Society (MASTEL) ten days prior to the first election round found that 44.3% of respondents received hoax content daily, and 17.2% multiple times a day.
Significantly, 91.8% of the hoax content involved socio-political issues, and 88.6% contained racial or ethnic provocations [4]. These findings underscore the urgency to combat the spread of such deceptive information.
Numerous techniques have been explored for identifying hoax news, including advanced machine learning techniques such as Recurrent Neural Networks (RNNs)[5] , Deep Learning [6], and increasingly popular methods like transformers [7]. A significant advancement within transformer technology, particularly for the Indonesian language, is the development of IndoBERT (Indonesia Bidirectional Encoder Representations from Transformers).
This pre-trained model represents a notable enhancement in the realm of transformers, specifically tailored to understand and process the Indonesian language. Unlike earlier machine learning approaches such as Recurrent Neural Networks (RNN) and generic Deep Learning methods, IndoBERT harnesses the transformers' architecture, which processes text bidirectionally, considering all words simultaneously[8]. This capability is essential for accurately classifying news as factual or hoax by capturing the nuances and context within Indonesian articles.
Recent studies have highlighted the effectiveness of using BERT-based models for various classification tasks in the Indonesian context. Fakhruzzaman et al. (2021) demonstrated the successful application of M-BERT for clickbait headline detection in Indonesian news sites, achieving an accuracy of 0.914, F1-score 0.91, and precision score of 0.916 [9]. Similarly, Faisal and Mahendra (2022) utilized IndoBERT for COVID-19 misinformation
detection in Indonesian tweets, proposing a two-stage classifier model that outperformed other machine learning models [10]. Sinapoy, Muhammad, et al. [11] compared the deep learning models LSTM and IndoBERT for detecting hoaxes on Twitter. The experimental results, based on a 10-fold cross-validation approach, revealed that the IndoBERT model exhibited robust performance with an average accuracy of 92.07%, while the LSTM model achieved an average accuracy of 87.54%. This indicates that the IndoBERT model is more effective in hoax detection tasks, consistently outperforming the LSTM model. However, these study used a deep learning on balanced data, whereas in real-world scenarios, datasets used for text classification tend to be imbalanced[12]. An imbalanced dataset refers to a scenario where the number of observations for one label is substantially higher than for other labels. This can lead to problems in classification like overfitting, where the model learns patterns of the majority class too well, neglecting the minority class. As a result, the model may not generalize well to new, unseen data and might be biased towards predicting the majority class, leading to a high rate of false negatives for the minority class [13]. Therefore, addressing imbalanced datasets is crucial for achieving more accurate and avoiding overfitting in deep learning models. Text classification on imbalanced data has been addressed in several studies [14]–[17]. These studies have employed resampling techniques such as SMOTE, random oversampling, and random under sampling to handle dataset imbalances. Furthermore, they have utilized various classification algorithms and feature representations to enhance model performance in text classification. The findings from these studies indicate that resampling techniques are effective in improving model performance when dealing with imbalanced data and can increase the performance and accuracy of classification models.
This study will concentrate on the classification issue, aiming to categorize news articles by determining whether they are factual or hoaxes Specifically, this research focused on detecting hoax news within the context of Indonesian political news using IndoBERT methods, this methods designed to process sequential input data, such as natural language [11], and the methods often been used in text classification task and have shown good performance results, especially in hoax detection task. The dataset used in this study is notably imbalanced, containing 6,947 articles labeled as hoaxes and only 20,945 labeled as non-hoaxes. This significant disparity illustrates the imbalanced condition of the dataset, which can bias the model's performance towards the majority class. To address this imbalance, the study implements random sampling techniques, This method ensures a more balanced dataset, which is crucial for building a reliable and unbiased classifier. The paper details experiments conducted on hoax classification models and evaluates the outcomes to identify the most effective strategy for addressing the issue of class imbalance.
2. RESEARCH METHODOLOGY
2.1 System Design
The provided flowchart, Figure 1, illustrates the systematic approach undertaken in the research methodology outlined in section 2.1, "System Design." The process commences with an initial dataset characterized by imbalanced class distribution. To address this, the data undergoes a series of preprocessing steps to ensure it is clean and structured appropriately for subsequent analysis. Following this, the dataset is divided into distinct subsets: training, validation, and test data, which are essential for developing a robust model. Specifically, the training data is subjected to resampling techniques, a crucial step to mitigate the initial imbalance and enhance the model's ability to generalize. The training subset is then used to train the IndoBERT model, a variant of the BERT architecture presumably adapted for the Indonesian language or context. Concurrently, the validation data plays a pivotal role in fine-tuning the model parameters, while the test data provides an unbiased evaluation of the model's performance. Finally, after training and validating the IndoBERT model, it undergoes a thorough evaluation phase to quantify its predictive capabilities.
Figure 1. Research Flow
2.2 Dataset
We utilized the “Indonesian Fact and Hoax Political News” dataset, Indonesian Fact and Hoax Political News (Indonesian Fact and Hoax Political News (kaggle.com)) (accessed on 18 September 2023), which is publicly available on Kaggle. This dataset comprises a curated collection of political news articles in Bahasa Indonesia, sourced from reputable news outlets such as CNN Indonesia, Kompas, and Tempo, as well as fact-checking from Turnbackhoax. It is designed to facilitate research in the domain of misinformation studies, specifically to develop models that can distinguish between factual reporting and fabricated stories within the Indonesian political news landscape. The dataset includes metadata such as the news source, publication date, and labels indicating whether the article is factual or a hoax, providing a valuable resource for computational linguistics and fake news detection algorithms.
Table 1. Example of Indonesian political news dataset
Number Text Label
1
Edy Soal Pilgub Sumut : Kalau yang Maju Abal - abal , Terpaksa Saya Maju Gubernur Sumatera Utara Edy Rahmayadi membuka
kemungkinan untuk kembali maju di Pilkada 2024 mendatang
0 FACT (validated by Kompas)
2
PKB Bakal Daftarkan Menaker Ida Fauziyah Jadi Caleg DPR di Pemilu 2024 Partai Kebangkitan Bangsa ( PKB ) bakal mengusung
Menteri Ketenagakerjaan Ida Fauziyah sebagai calon anggota legislatif ( Caleg ) di Pemilu 2024 mendatang
0 FACT (validated by Kompas)
3
Nenek lampir pemimpin partai banteng bercula satu lagi main slot bersama anaknya kang matiin mic Kapan lu tobat nek Semoga om ganjar keluar dari partainya nenek lampir partai banteng bercula satu
# Cuma orang goblok yang mau jadi babunya nenek lampir partai banteng bercula satu
1
HOAX (validated by Turnbackhoax)
4
Beredar sebuah unggahan di Facebook dengan narasi yang menyebut bahwa Ketua Umum PDI Perjuangan, Megawati Soekarnoputri,
dipanggil Bawaslu. Dalam narasi pada thumbnail video yang disematkan, menyebut karena panggilan tersebut buntut dari ucapannya yang mengakibatkan Bawaslu mem-blacklist PDIP dari
peserta Pilpres sehingga tidak bisa usung Capres.
1
HOAX (validated by Turnbackhoax)
In the “Indonesian Political News” dataset, the process of data labeling employs a binary system where a
‘1’ indicates a hoax content and a ‘0’ signifies factual content. This binary labeling system simplifies the classification process for machine learning algorithms, which can then be trained to identify patterns associated with each category. A label of '1' is assigned to articles that have been verified as containing misinformation, misleading narratives, or complete fabrications, as established by the scrutiny of Turnbackhoax. In contrast, a label of '0' confirms that the article adheres to factual reporting, corroborated by credible news sources such as CNN Indonesia, Kompas, and Tempo.
2.2 Preprocessing
Before text classification, several preprocessing steps are typically carried out to prepare the text data for analysis.
These steps can include:
a. Case Folding / Lower Casing.
This process involves converting all letters in a text to lowercase. It is done to ensure uniformity in the text analysis, as it treats uppercase and lowercase letters as the same, preventing case-related discrepancies.
b. Remove numbers, certain symbols (#@$%&), URLs (www.google.com), and character repetitions from sentences.
This step aims to clean the text by eliminating numerical digits, specific symbols, URLs, and repetitive characters. The removal of such elements helps focus the analysis on the meaningful content of the text.
c. Remove punctuation.
Punctuation removal involves getting rid of punctuation marks such as commas, periods, and exclamation points from the text. This is often done to simplify the text and facilitate subsequent processing or analysis.
d. Remove duplicate articles of news
This involves identifying and removing duplicate articles or news items from the dataset. Duplicate removal ensures that the analysis is based on unique content, avoiding redundancy in the information.
e. Remove whitespace
Whitespace removal involves deleting spaces, tabs, and line breaks from the text. This is done to standardize the text format and make it more manageable for further processing.
f. Remove excess spaces.
This step focuses on eliminating extra spaces between words or sentences. It contributes to text normalization and ensures a consistent and clean appearance of the text data.
In text classification tasks, BERT (Bidirectional Encoder Representations from Transformers) does not require the traditional preprocessing steps of stopword removal and stemming due to its inherent design and working mechanism. Unlike traditional machine learning models that rely on preprocessed input to reduce dimensionality and noise, BERT learns contextual relationships between words in a sentence by considering the full context of words in both directions (bidirectionally). The model is pretrained on a large corpus of text, enabling it to understand nuanced meanings and relationships between words, effectively making the removal of stopwords and stemming redundant. Stopwords, often considered noise in simpler models, contribute to the semantic meaning in BERT's context, providing valuable context clues. Similarly, stemming, which reduces words to their root forms, is unnecessary as BERT can understand different word forms and their nuanced meanings. Therefore, retaining the full text, including stopwords and word forms, allows BERT to leverage its deep contextual learning, leading to a more nuanced and accurate understanding of the text, which is critical for complex tasks like classification.
2.3 Dataset Splitting
Dataset splitting is a technique to evaluate a model's performance by dividing the data into training, validation, and testing sets. This study uses a 70% train, 20% validation, and 10% test split. The training data, which comprises 70% of the dataset, is utilized to build the model by learning from the features and labels—it's the core of the model's learning process. The validation data, accounting for 20%, is employed periodically during the model's training phase to tune hyperparameters and help prevent the model from overfitting, ensuring it can generalize well to new data. Overfitting occurs when a model learns the training data too well, capturing noise along with the underlying patterns, which can negatively impact its performance on unseen data. The validation set helps to mitigate this by providing a separate data pool that can be used to validate the model's decisions during training.
Lastly, the testing data, which is 10% of the dataset, is used after the model has been trained and validated. This final evaluation phase is critical because it provides an unbiased assessment of the model's performance, simulating how it would perform in a real-world scenario where it encounters data it has not been exposed to before. This separation of data into distinct sets enables a thorough evaluation of the model's predictive power and robustness.
Accompanying this explanation, Table 2 presents numerical data on how the dataset is distributed across different phases of model training and validation. Specifically, it compares the number of data points before and after applying Random Under Sampling (RUS) and Random Over Sampling (ROS), which are techniques used to balance the class distribution in the dataset. The table provides a clear breakdown: the "Source Dataset" row shows the initial distribution, the "Dataset after RUS" row indicates the reduced dataset size to counteract class imbalance, and the "Dataset after ROS" row reflects the increased size of the underrepresented class to achieve balance.
Table 2. Table of Data Splitting
Dataset Data train Data Validation Data test
1 0 1 0 1 0
Source Dataset 5197 16756 1750 4189
Dataset after RUS 5197 5197 1040 1040 1300 4189 Dataset after ROS 16756 16756 3352 3352 2.4 BERT (Bidirectional Encoder Representations from Transformers)
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing pre-training developed by Google. BERT is designed to understand the context of a word in search queries and better grasp the nuances and context of words in text. It stands out for its bidirectional approach, allowing the model to consider the context from both sides of a token within the text, rather than only looking at the context on one side [18]. BERT effectively manages bidirectional representation of anonymized text by integrating left and right context throughout all sections, creating a comprehensive contextual understanding. Modifying the existing BERT model slightly can offer solutions to a variety of problems. BERT's strength lies in its simplicity and clear interpretability. This versatility is demonstrated by its applicability in 11 programming languages, achieving a GLUE score of 80.5%, MultiNLI accuracy of 86.7%, SQuAD v1.1 F1 test score of 93.2, and SQuAD v2.0 F1 test score of 83.1 [18]. The standardized data results are depicted in Figure 2.
Figure 2. BERT Architecture [18]
During the pre-training phase, BERT employs two unsupervised tasks outlined in Figure 2. The initial task, Masked Language Model (MLM), involves the model predicting the [MASK] word by utilizing surrounding context. In [18], a 15% mask is applied to randomly consecutive Word Piece tokens, and the model is trained to predict these masked tokens. However, a drawback is the potential mismatch between pre-training and fine-tuning, as the [MASK] token doesn't appear during fine-tuning. To address this, only 80% of masked words are replaced with [MASK] tokens, 10% with arbitrary words, and the remaining 10% are left unchanged [18].
The subsequent task is Next Sentence Prediction (NSP), where the model takes a pair of sentences as input and learns to predict whether the second sentence follows the first in the actual document. In the training process [18], 50% of inputs consist of pairs where the second sentence is the subsequent sentence from the original document, and the other 50% are arbitrary sentences selected as the second sentence from the corpus. It is assumed that the arbitrary sentence is unrelated to the first sentence [18]. Figure 2 illustrates the input process on the BERT model, and the standardized data results are presented in Fig.3.
Figure 3. BERT Input Representation [18]
Shortly, in the pre-trained process, BERT is trained on a large corpus of unlabeled text data using various pre-training tasks such as masked language modeling and next sentence prediction. This process allows BERT to learn a general language representation, which can be fine-tuned for specific downstream tasks. Subsequently, in the fine-tuning phase, the BERT model is initialized with pre-trained parameters and further adjusted using labeled data from the downstream task. Each downstream task has a separate fine-tuned model, although they are all initialized with the same pre-trained parameters[18].
2.5 IndoBERT (Indonesia Bidirectional Encoder Representations from Transformers)
IndoBERT is a variant of BERT, specifically pre-trained on Indonesian language data to adapt the BERT model for a better understanding and processing of the Indonesian language, with its unique syntactical and contextual nuances. This adaptation makes IndoBERT highly effective for NLP tasks involving Indonesian text such as sentiment analysis, text classification, and named entity recognition. In a related study, the authors implemented a fine-tuning technique using the IndoBERT-base-p1 model, a BERT-base architecture variant. This method involved using a pre-trained model which requires minimal additional training to be optimally adapted to new tasks. IndoBERT-base-p1 was trained on 4 billion words, covering around 250 million formal and colloquial Indonesian sentences [19]. The research utilized the Transformers library provided by HuggingFace, which offers thousands of pre-trained models for a wide range of tasks including classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages, supported by leading deep learning libraries PyTorch and TensorFlow.
To achieve contextual representation of our input texts, we fine-tuned a pre-trained BERT model from the Transformers library. We adapted the fine-tuned BERT architecture, as described in [18] to address the classification task within our hoax detection system. The design of our adapted fine-tuned architecture is illustrated in Figure 4.
Figure 4. BERT Architecture for Hoax Classification [20]
In our approach, token embeddings were used to capture the significance of each token, while segment embeddings differentiated between the title and body of the article. Additionally, position embeddings were employed to denote the placement of tokens within our input sequences. These embeddings were then combined and input into BERT's Transformer layer. We utilized the leading [CLS] token from the context as a representation for the entire sequence of tokens. Subsequently, a classification layer was integrated to ascertain if an article is a hoax.
2.6 Resampling Method
Previous studies have utilized various resampling techniques to address imbalanced datasets [21],[22], including : a. Random Oversampling
Random oversampling is a technique in machine learning that addresses class imbalance by duplicating members of the minority class in the training data. This method helps to equalize the class distribution, aiding in better model training and performance [23]
b. Random Undersampling
Random undersampling is a technique used to address the issue of data imbalance. This method operates by randomly removing members of the majority class from the training dataset [23]. The goal is to produce a more balanced class distribution for machine learning models, helping them to perform better on minority class data 2.7 Evalution Model
Previous studies have utilized various resampling techniques to address imbalanced datasets, including : a. Confusion Matrix
The performance of the IndoBERT model will be assessed using a confusion matrix to gauge its efficacy in sentiment prediction on the test dataset. This evaluation will employ the confusion matrix to display the quantity of accurate and erroneous predictions across each sentiment category. Table 3 provides a visual representation of this evaluation metric. The table is divided into four quadrants, each representing a different outcome of the predictions:
1. TP (True Positive) = Correct identification of a positive case.
2. TN (True Negative) = Correct identification of a negative case.
3. FP (False Positive) = Incorrectly labeling a negative case as positive.
4. FN (False Negative) = Failing to identify a positive case
Table 3. Table of Confusion Matrix
Classification Positive Prediction Negative Prediction
Certainly Positive TP FN
Certainly Negative FP TN
b. Accuracy
Accuracy is the proportion of correct predictions made of all predictions. The following formula:
Accuracy = True Positig+True Negatif
True Positif+False Positif+False Negarif+True N (1)
c. F1 – Score
F1 - Score is a comparison of the average between precision and recall. The following is the F1-score formula:
F1 = 2 x (Recall x Precision)
(Recall+Precision) (2)
d. Precision
Precision is the percentage of positive predictions made by correct classification. The following is the precision formula:
Precision = True Positif
True Positif + False Positif (3)
e. Recall
A recall is a true positive pattern that is correctly detected by the classifier. Here is the recall formula:
Recall = True Positif
True Positif + False Negatif (4)
3. RESULT AND DISCUSSION
3.1 Data Construction
A dataset comprising 27,747 text news articles is categorized into two conditions: imbalanced and balanced. In the imbalanced state, it consists of 20,945 articles labeled as non-hoax and 6,947 labeled as hoax. To address this
imbalance, two resampling methods were employed: Random Over Sampling (ROS) and Random Under Sampling (RUS). With the RUS method, the dataset was balanced by reducing its size to a total of 8,314 articles, with each label (hoax and no hoax) represented equally by 4,157 articles. In contrast, using ROS, the dataset was augmented to a total of 26,808 articles, with each label evenly distributed, resulting in 13,404 articles per label, thereby achieving a balanced dataset. This approach effectively manages the disparity in the original dataset, ensuring a more equitable representation of both labels for subsequent analyses.The distribution label of dataset show at table 4.
Table 4. Table of distribution label of dataset
Dataset
hoax’s label (1) fact’s label (0)
Data train
Data Validation
Data Test
Data train
Data Validation
Data Test Imbalanced
Dataset 5197 1750 1300 16756 4189 4189
Dataset after RUS 5197 1040 1300 5197 1040 4189
Dataset after ROS 16756 3352 1300 16756 3352 4189
3.2 Text Classification
In our research, we utilized IndoBERT, a model pre-trained for Indonesian language processing, for text classification tasks. We adapted the base model of IndoBERT to our specific requirements. The customization of the model to fit the research requirements is captured in Table 5, which enumerates the hyperparameters optimized for this specific application. The table presenting critical values such as a learning rate of 3e-6, which influences the optimization speed; a batch size of 32, dictating the number of samples processed before the model's internal parameters are updated; and the employment of 16 workers, possibly indicating parallelism in data preprocessing.
Additionally, the table specifies the model variant "indoBERT-base-p1," providing insights into the exact architecture and configuration used. This meticulous documentation of hyperparameters is essential for the reproducibility of the research findings, allowing for a clear understanding of the model training process and its impact on the text classification results.
Table 5. Hyperparameters in Fine-Tuned IndoBERT
Hyperparameter Value
Learning rate 3,00E-06
Batch size 32
Num Worker 16
Model “indobenchmark/indobert-base-p1”
The objective of this study is to evaluate the capability of the IndoBERT model in identifying false news in the Indonesian language. Initially, the model displayed high accuracy during the initial phase of classification.
However, further analysis revealed that the model was overfitting due to an imbalanced dataset. To address this, we applied oversampling and undersampling techniques to achieve data balance. As a result, the model's performance improved significantly in terms of validation accuracy, and the issue of overfitting was substantially mitigated, as demonstrated in Figure 5.
Imbalanced Dataset Dataset after RUS
Dataset after ROS Figure 5. Training Result
In an imbalanced dataset, the model initially demonstrates a high level of precision (97.5%), recall (96.0%), and F1 score (96.7%), with an overall accuracy of 97.6%. The application of RUS and ROS techniques materially ameliorates the balance of the dataset, which in turn, positively influences the model's validation accuracy. Post-
RUS, there is a notable increment in recall (97.8%) and F1 score (97.5%), with accuracy ascending to 98.2%.
Similarly, post-ROS application, precision rises marginally to 97.1%, and recall to 97.5%, cumulatively contributing to a F1 score of 97.3% and an accuracy of 98.0%. The result show at table 6.
Table 6. Result of IndoBERT
Dataset Precisscion Recall F1 Score Accuracy Imbalanced Dataset 97.5% 96.0% 96.7% 97.6%
Dataset after RUS 97.2% 97.8% 97.5% 98.2%
Dataset after ROS 97.1% 97.5% 97.3% 98.0%
Figure 6 presents three confusion matrices corresponding to the datasets before and after the application of RUS and ROS. These matrices provide a visual representation of the model's classifications. For the imbalanced dataset, we observe a difference between the number of true positives and true negatives, which is expected given the dataset's skewed nature. Post-RUS and ROS, the confusion matrices exhibit a more balanced classification, with an increase in true negatives and a reduction in false negatives, showcasing the effectiveness of the balancing techniques in improving the model's performance.
Imbalanced Dataset Dataset after RUS
Dataset after ROS Figure 6. Confusion Matrrix
Comparatively, the model's performance after the application of RUS and ROS exhibits a significant reduction in overfitting, which is evident from the higher accuracy and more balanced confusion matrix results.
The enhanced performance corroborates the hypothesis that a balanced dataset is critical for the generalization capabilities of deep learning models, particularly for classification tasks involving imbalanced classes.
4. CONCLUSION
This study embarked on a journey to address the challenges of hoax detection in Indonesian political news, particularly under the conditions of class imbalance, a scenario commonly encountered in real-world datasets. Our approach hinged on the use of the IndoBERT model, a BERT-based framework specifically pre-trained on the Indonesian language. The primary objective was to evaluate the efficacy of IndoBERT in classifying news articles as hoax or non-hoax, amidst varying dataset compositions. The study's findings demonstrate the IndoBERT model's consistent accuracy across different resampling methods like Random Over Sampling (ROS) and Random Under Sampling (RUS), highlighting its effectiveness in handling imbalanced datasets, with the highest accuracy was observed on the dataset that underwent resampling with the Random Under Sampling (RUS) method, achieving 98.2% accuracy. Along with this, the model also recorded a recall of 97.5%, an F1-score of 97.8%, and a precision of 97.2%. These metrics collectively suggest that RUS was particularly effective in enhancing the IndoBERT model's performance, making it a promising approach for dealing with imbalanced datasets in the context of text classification tasks within the Indonesian language. This is particularly relevant for tasks like misinformation detection, where data imbalance is common. The success of IndoBERT, a language-specific BERT model, in text classification for the Indonesian language contributes to the understanding of BERT-based models in diverse linguistic contexts. While insightful, this research also suggests the need for further investigation into IndoBERT's performance with more varied datasets and resampling strategies, and its comparison with other pre- trained models, to fully grasp its potential and optimize its application in real-world scenarios. In conclusion, the study reaffirms the suitability of the IndoBERT model for hoax detection in Indonesian political news, even in the
face of dataset imbalances. It offers a promising direction for future research in the realm of NLP, particularly in the development and refinement of language-specific models for accurate and efficient text classification.
REFERENCES
[1] M. A. Rahmat, Indrabayu, and I. S. Areni, “Hoax Web Detection For News in Bahasa Using Support Vector Machine,”
2019 International Conference on Information and Communications Technology (ICOIACT), 2019, doi:
10.1109/ICOIACT46704.2019.8938425.
[2] Hanadian Nurhayati Wolff, “Internet usage in Indonesia - statistics & facts.” Accessed: Nov. 11, 2023. [Online].
Available: https://www.statista.com/topics/2431/internet-usage-in-indonesia/
[3] SIMON KEMP, “DIGITAL 2020: INDONESIA.” Accessed: Nov. 11, 2023. [Online]. Available:
https://datareportal.com/reports/digital-2020-indonesia
[4] P. Utami, “Hoax in Modern Politics: The Meaning of Hoax in Indonesian Politics and Democracy,” Jurnal Ilmu Sosial dan Ilmu Politik, vol. 22, no. 2, p. 85, Jan. 2019, doi: 10.22146/jsp.34614.
[5] J. A. Nasir, O. S. Khan, and I. Varlamis, “Fake news detection: A hybrid CNN-RNN based deep learning approach,”
International Journal of Information Management Data Insights, vol. 1, no. 1, Apr. 2021, doi:
10.1016/j.jjimei.2020.100007.
[6] A. Wani, I. Joshi, S. Khandve, V. Wagh, and R. Joshi, “Evaluating Deep Learning Approaches for Covid19 Fake News Detection”, doi: 10.48550/arXiv.2101.04012.
[7] R. K. Kaliyar, A. Goswami, and P. Narang, “FakeBERT: Fake news detection in social media with a BERT-based deep learning approach,” Multimed Tools Appl, vol. 80, no. 8, pp. 11765–11788, Mar. 2021, doi: 10.1007/s11042-020-10183- 2.
[8] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” Nov. 2020, doi: 10.48550/arXiv.2011.00677.
[9] M. N. Fakhruzzaman, S. Z. Jannah, R. A. Ningrum, and I. Fahmiyah, “Clickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT),” Feb. 2021, [Online].
Available: http://arxiv.org/abs/2102.01497
[10] D. R. Faisal and R. Mahendra, “Two-Stage Classifier for COVID-19 Misinformation Detection Using BERT: a Study on Indonesian Tweets,” Jun. 2022, doi: 10.48550/arXiv.2102.01497.
[11] Muhammad Ikram Kaer Sinapoy, Yuliant Sibaroni, and Sri Suryani Prasetyowati, “Comparison of LSTM and IndoBERT Method in Identifying Hoax on Twitter,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 7, no. 3, pp.
657–662, Jun. 2023, doi: 10.29207/resti.v7i3.4830.
[12] S. Al-Azani and E. S. M. El-Alfy, “Imbalanced Sentiment Polarity Detection Using Emoji-Based Features and Bagging Ensemble,” in 1st International Conference on Computer Applications and Information Security, ICCAIS 2018, Institute of Electrical and Electronics Engineers Inc., Aug. 2018. doi: 10.1109/CAIS.2018.8441956.
[13] H. A. Najada and X. Zhu, “iSRD: Spam review detection with imbalanced data distributions,” Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), 2014, doi:
10.1109/IRI.2014.7051938.
[14] S. Al–Azani and E. M. El–Alfy, “Imbalanced Sentiment Polarity Detection Using Emoji-Based Features and Bagging Ensemble,” 2018 1st International Conference on Computer Applications & Information Security (ICCAIS), pp. 1–5, 2018, doi: 10.1109/CAIS.2018.8441956.
[15] H. A. Najada and X. Zhu, “iSRD: Spam review detection with imbalanced data distributions,” Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), 2014.
[16] Fransiscus and A. S. Girsang, “Sentiment Analysis of COVID-19 Public Activity Restriction (PPKM) Impact using BERT Method,” International Journal of Engineering Trends and Technology, vol. 70, no. 12, pp. 281–288, Dec. 2022, doi:
10.14445/22315381/IJETT-V70I12P226.
[17] W. Satriaji and R. Kusumaningrum, “Effect of Synthetic Minority Oversampling Technique (SMOTE), Feature Representation, and Classification Algorithm on Imbalanced Sentiment Analysis,” 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 2018, doi: 10.1109/ICICOS.2018.8621648.
[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, doi: 10.18653/v1/N19-1423.
[19] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Sep.
2020, doi: 10.48550/arXiv.2009.05387.
[20] L. H. Suadaa, I. Santoso, and A. T. B. Panjaitan, “Transfer Learning of Pre-trained Transformers for Covid-19 Hoax Detection in Indonesian Language,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 3, p. 317, Jul. 2021, doi: 10.22146/ijccs.66205.
[21] Y. Muliono, F. L. Gaol, B. Soewito, and H. L. H. S. Warnars, “Hoax Classification in Imbalanced Datasets Based on Indonesian News Title using RoBERTa,” in 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 264–268. doi:
10.1109/AiDAS56890.2022.9918747.
[22] A. D. Sanya and L. H. Suadaa, “Handling Imbalanced Dataset on Hate Speech Detection in Indonesian Online News Comments,” 2022 10th International Conference on Information and Communication Technology (ICoICT), pp. 380–385, 2022, doi: 10.1109/ICoICT55009.2022.9914883.
[23] W. Obaid and A. Nassif Bou, “The Effects of Resampling on Classifying Imbalanced Datasets,” 2022 Advances in Science and Engineering Technology International Conferences (ASET), 2022, doi: 10.1109/ASET53988.2022.9735021.