Part-of-Speech Tagging Implementation on Telkom University News using Bidirectional LSTM Method

(1)

Part-of-Speech Tagging Implementation on Telkom University News using Bidirectional LSTM Method

Rheza Ramadhan Putra*, Donni Richasdy, Aditya Firman Ihsan School of Computing, Informatics Study Program, Telkom University, Bandung, Indonesia Email: ^1,*[email protected], ²[email protected],

3[email protected]

Correspondence Author Email: [email protected]

Abstract−News is a tool used to disseminate information through various media, one of which is the internet. Various kinds of news articles have words that are not recognized in the dictionary such as slang words and have foreign words that do not exist in the corpus. How can a POS tagging model built on the corpus be able to handle word class labeling in Indonesian news.

The research was conducted to check the results of POS tagging on a collection of news about Telkom University which was selected manually. By using the bidirectional LSTM model, three test scenarios were attempted to improve the performance of the built model, the test scenarios were applying the best padding for the corpus, comparing the performance results of the modified corpus model with the original corpus model, and determining the dimensions of the Word2vec vector. Then the selected model from each corpus is implemented on the news that has been labeled manually. One of the best scenario tests is obtained by modifying the corpus by removing double words in the word class "X" and changing some of the word classes "X"

which are more likely to be foreign words so that they are changed to the word class "FW". The best performance results in the implementation of news about Telkom University using the bidirectional LSTM model which was built based on the modified corpus get accuracy values of 92.74%, precision of 92.85%, recall of 92.74%, and F1-score 92.48%.

Keywords: Pos Tagging; Bidirectional LSTM; Indonesian; News

1. INTRODUCTION

Daily activities are inseparable from the two-way communication that has occurred in society from the past to the present. Of the various kinds of languages in the world, one of the most popular languages is Indonesian.

Indonesian itself is the official language of the country of Indonesia. Derived from the Malay language, Indonesian as one of the regional languages in the archipelago, then developed into an intermediary language 'lingua franca' between communities [1]. Indonesian is one of the most widely spoken languages in the world, spoken by around 250 million speakers [2]. The use of the Indonesian language is spread through many media with formal and informal writing, one of which is the spread through the news media. Along with the development of the times, the use of slang or foreign words is increasingly being used in many mass media. News is also one of the important things to spread new information in the world.

In that sense, the news itself means information that can be presented through print, broadcast, internet, or even word of mouth, so the news is one of the important things for the wider community [3]. News that is developing at this time does not always have a formal language, but also informal. So many news articles are created and shared online every day, making it difficult for users to find the news they are interested in [4]. So in news writing, foreign languages and slang words are always a consideration for adding stories to the news to make it look more interesting. The choice of words used is also a consideration so that the article is well organized in terms of language. Along with the times, various kinds of languages can be processed using the help of computers for many purposes such as question-and-answer applications.

Even though Indonesian is one of the most widely used languages, it still has limited resources for Natural Language Processing (NLP) needs [5]. NLP itself means text processing which is processed into machine language. One of the basic techniques in NLP is Part-of-speech (POS) tagging because it is used in most NLP, such as applications for sentiment analysis, question-answering tools, word sense disambiguation, etc. [6]. In that sense, POS tagging is a process in word class labeling [7]. POS tagging works by automatically labeling word classes in a sentence [8]. POS tagging has several approaches commonly used including rule-based, probabilistic, and transformational-based. Rule-based POS tagger assigns a word class label to a word based on manually created linguistic rules. For a probabilistic approach by finding the highest probability of a class of words based on the context of the sentence [9]. Meanwhile, the transformational-based approach is a combination of rule-based and probabilistic approaches. In addition, one approach to POS tagging is using a neural network. Bidirectional Long- Short Term Memory (LSTM) is one of the types of Recurrent Neural Network (RNN) which has the ability to combine contextual information from opposite LSTM two-way input lengths and has been proven as an effective model for performing sequential labeling [10].

Lots of research on POS tagging has been carried out in various languages, one of which is Indonesian, which has been carried out by several researchers. One of them was done by Handrata et al. [11] by conducting research on POS tagging using the bidirectional LSTM method in Indonesian using a corpus from one of the universities, the University of Indonesia [11]. The corpus is used by Handrata et al. [11] as training data and test data. Other research on POS tagging for Indonesian was also carried out by other researchers using various methods. Cahyani et al. [7] did POS tagging in Indonesian using the Hidden Markov Model method. Pisceldo et

(2)

al. [9] evaluated the two probabilistic models, Maximum Entropy and CRF (Conditional Random Fields). Rashel et al. [12] build an Indonesian language POS tagger using a rule-based approach. Local languages are also used, including Javanese [13] and Balinese [14].

Based on several considerations, the research conducted was to determine the performance of the POS tagging model built using the bidirectional LSTM model on Indonesian language news texts. By making several modifications to the Indonesian language corpus which is used to adapt the current news structure. Some of them are words that rarely have errors because the writing system has been adjusted along with the times. Because of that the research focuses on modifications to the word class "X" and "FW" as the main goal. In implementing the model for the tested news text, the selected text is a news text related to Telkom University. One of the reasons for choosing Telkom University's news is that it is one of the universities that are active in terms of reporting, one of which is the announcement related to scholarships.

2. RESEARCH METHODOLOGY

2.1 System Design

In the research conducted, an overview of the model design process for POS tagging using the bidirectional LSTM method is shown in Figure 1. The initial step of the research is to determine the news dataset to be used and then select three news texts to be tested into the built model. The three news texts are processed through the same process as the corpus through tokenization and pad sequences. For the corpus, the dataset is separated into training data and test data. Meanwhile, news text is only used as test data to determine model performance. The test was carried out by applying the bidirectional LSTM model which was built based on the original and the modified Indonesian corpus. The preprocessing stage is carried out for all corpus models that will be tested. The final stage of the research is to compare the results of the performance evaluation of the model built on the modified and original corpus.

Figure 1. POS tagging system design process 2.2 Indonesian Corpus Dataset

Building a POS tagging model with bidirectional LSTM certainly requires a corpus dataset. The corpus can be defined as a collection of authentic machine-readable texts that are sampled to represent a particular natural language or variety of languages [15]. The corpus dataset used in this research is the Indonesian language dataset.

The Indonesian language corpus dataset used as training data and model tests that were built came from the University of Indonesia [11]. Corpus gathers data from the Pan Asia Networking Localization network (PANL10n). The corpus also has a total of 23 word classes used in it. In Table 1 are all word classes and examples in the corpus dataset.

Table 1. Word Class on Corpus

Word Class Description Example

CC Coordinating conjunction dan, tetapi, atau

CD Cardinal number dua, sepertiga, 15

OD Ordinal number pertama, ke-4

DT Determiner para, sang, si

(3)

Word Class Description Example

FW Foreign word climat change

IN Preposition di, dalam, oleh

JJ Adjective bersih, hitam, jauh

MD Modal and auxiliary verb boleh, harus, sudah

NEG Negation tidak, perlu, jangan

NN Noun bawah, sekarang, rupiah

NNP Proper noun Laut Jawa, Indonesia

NND Classifier, partitive, and measurement noun orang, ton, lembar

PR Demonstrative pronoun ini, itu, sini

PRP Personal pronoun saya, kami, mereka

RB Adverb sangat, hanya, justru

RP Particle pun, -lah, -kah

SC Subordinating conjunction jika, sebab, bahwa

SYM Symbol IDR, +, %

UH Interjection mari, hai, ayo

VB Verbs merancang, mengatur

WH Question apa, kenapa, bagaimana

X Unknown statemen

Z Punctuation “…”, ?, .

In the corpus dataset used, the sentences contained in the corpus come from news snippets. Then the group of sentences was labeled word class manually by Rashel et al. [12]. The Indonesian corpus consists of 10.000 sentences with a total of 256.683 tokens. Some examples of words and their classes in the corpus are shown in Table 2.

Table 2. Example Sentences on Corpus Word Class

Kera NN

untuk SC

amankan VB

pesta olahraga NN

Namun CC

, Z

juru bicara NNP Wingnut NNP mengatakan VB

mereka PRP

ingin RB

“ Z

bersiap-siap VB

” Z

. Z

2.3 Modified Corpus Dataset

In the research conducted, we did not only test the original corpus dataset. Instead, it was tested on a modified corpus dataset based on several considerations. Modifications were made to determine the effect obtained when modifying the original corpus used. In modifying, there are two things to consider, the first is to remove double words in sentences in the word class "X" as a sign of a writing error. The second is to change several word classes

"X" which have writing errors in sentences that tend to be foreign word classes so that they are changed to "FW"

word classes. The comparison before and after the modification is shown in Table 3.

Table 3. Example Before and After Modification Before After

atau CC atau CC setara NN setara NN dengan IN dengan IN dengan X 5,6 CD

5,6 CD bulan NN bulan NN

kata VB kata VB

(4)

-nya PRP -nya PRP dalam X dalam IN dalam IN Pidato NN Pidato NN

It can be seen in Table 3 that the differences before the deletion of double words in some words were labeled as writing errors. One of the reasons for this deletion is that it is very rare to encounter problems in writing errors in the news nowadays, one of which is double words in the corpus. Another reason is one of the considerations to increase the accuracy of the model built so that it is not wrong in the process of marking words that are supposed to be repeated words. Changing several classes of words from "X" to "FW" is done in order to improve performance for foreign words in the corpus.

2.4 News Dataset

In the research conducted, one of the testing stages was implementing POS tagging on original news collected through several news portals. The news collected is Indonesian language news related to Telkom University. Of the many news stories, three stories were selected to be tested for POS tagging implementation. For news that will be implemented first, the selected news has been marked by word class manually based on the data in the corpus.

Shown in Table 4 are some examples of news collected.

Table 4. Example of News Dataset about Telkom University

Title Category Author Date Article Scrape

Time Daftar THE Young

University Rankings 2021, Ada Kampus dari

Indonesia Lho

detikEdu Trisna Wulandari

Kamis, 24 Jun 2021 13:31 WIB

Times Higher Education (THE) merilis daftar THE Young University

Rankings 2021 …

14/07/2021 18:58

Persiapan PTM, 2.300 Mahasiswa Telkom University Divaksinasi

detikEdu Muhammad Iqbal

Senin, 21 Jun 2021 13:42 WIB

Sebanyak 2.300 mahasiswa Telkom University ikut serta

disuntik vaksin …

14/07/2021 18:59

2.5 Data Preprocessing

Data preprocessing is an important step to make it easier for the computer to read data for processing. Appropriate preprocessing stages can improve the performance of the built model. Several stages are devoted to building the right model. The research conducted to build a bidirectional LSTM POS tagging model involved several preprocessing stages, including tokenization, pad sequences, and word embedding.

2.5.1 Tokenization

Tokenization is a process of breaking sentences into individual words. Not only that, but tokenization can also be in the form of changing words into integer numbers as shown in Table 5 which illustrates the results of tokenization. This is done because the LSTM bidirectional tagging POS model cannot accept words explicitly, it must convert the words into integer form. Tokenization is carried out on words that will be processed as training data or words that will be tested and also word classes that will be used. For the word classes and word groups, the words and integers must be stored in a file in the form of JSON which will then be used in the POS tagging model that will be used.

Table 5. Example Before and After Tokenization Word

Class Before After

Word ['Monyet', 'besar', 'akan', 'dikerahkan', 'di', 'dua', 'stadion', 'untuk', 'mengusir', 'serbuan', 'monyet', 'kecil']

[1847, 81, 17, 3862, 7, 105, 5230, 11, 2294, 4455, 1847, 242]

Class ['NN', 'JJ', 'MD', 'VB', 'IN', 'CD', 'NN', 'SC', 'VB', 'NN', 'NN', 'JJ']

[23, 2, 5, 17, 22, 3, 23, 13, 17, 23, 23, 2]

2.5.2 Pad Sequences

The stages of pad sequences or padding are one of the steps for determining the length of a sequence of a sentence.

Sentences to be processed have different lengths. Therefore, the steps taken can make sentences shorter or cut long sentences into a fixed size for each data. Because the LSTM bidirectional POS tagging model requires a fixed sentence length. Blank words are filled with zeros and zeros can be placed on the left or right side according to what is set by one of the parameters in the pad sequences. Table 6 is an example of the results of pad sequences of the padding number 30 by placing a zero on the left side.

(5)

Word

Class Before After

Word [1847, 81, 17, 3862, 7, 105, 5230, 11, 2294, 4455, 1847, 242]

[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1847 81 17 3862 7 105 5228 11 2294 4454 1847 242]

Class [23, 2, 5, 17, 22, 3, 23, 13, 17, 23, 23, 2]

[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 2 5 17 22 3 23 13 17 23 23 2]

2.5.3 Word Embedding

Word embedding is a way of representing words as vector values in a semantically meaningful space. Word embedding is trained in an unsupervised way, usually used for large amounts of data, and capable of capturing semantic information. One example of implementing a word embedding model and application is Word2vec and another is GloVe [16]. At its stage word embedding works in a way to convert words as vector values as shown in Table 7. The vector values are calculated to show the semantic similarity between the words in the vector. The research was conducted using Word2vec as word embedding, which was developed by Mikolov et al. [17]. After being trained, the word embedding model can detect synonymous words.

Table 7. Example of Word Embedding

Before After

dan [ 0.04584909 0.01794344 -0.01987719 -0.22097957 -0.20313422 -0.0075354 … pada [ 1.51993707e-02 -4.02968749e-02 -6.06044643e-02 -2.09476963e-01 … dengan [-8.79684463e-03 6.46051625e-03 -2.43408039e-01 -1.82369828e-01 … 2.6 POS Tagger Bidirectional LSTM

Bidirectional Long-Short Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that can combine contextual information and has been proven to be an effective model for labeling [10]. Bidirectional LSTM works through two opposite LSTMs [18]. In the study of Handrata et al. [11], it was explained that the LSTM architecture explicitly divides into two parts of the vector, where one half is treated as memory cells and the other as working memory. Memory cells are formed to maintain a memory. For each input, a gate is used to decide how much of the new input should be written to the cell's memory and how much of the cell's memory content should be removed.

Figure 2. Bidirectional LSTM as POS tagging

Based on the study of Anbukkarasi et al. [19], The process of using bidirectional LSTM in POS tagging is as shown in Figure 2, wi is a one-hot representation of the current word which is a binary vector with dimensions

|V|, V as a vocabulary. To reduce the size of |V|, any incoming letters are lowercase. Uppercase information is stored in the three-dimensional binary vector f(wi) which also indicates whether wi includes all lowercase, uppercase, or capitalized prefixes. input to vector Ii is calculated as

𝐼_𝑖 = 𝑊_1𝑤𝑖 + 𝑊₂𝑓_(𝑤𝑖) (1)

In Equation (1), W1 and W2 are weight matrices, both connecting the two layers. W1wi is also known as word embedding, wi which is a vector value. In practice, W1 is a lookup table, and W1wi is obtained by referring to the wi word embedding stored in the table. The output layer is a softmax layer whose dimensions are the number of tag types. The output displays the tag probability distribution of the word wi.

2.7 Model Evaluation

The stages of model evaluation were carried out in POS tagging research using a bidirectional LSTM which is a confusion matrix. In machine learning classification, the confusion matrix is widely used as a model performance

(6)

calculation [20]. There are four important components in the confusion matrix, including True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). These four components are important for calculating the accuracy of the model, precision, recall, and f1-score used in the research conducted.

3. RESULT AND DISCUSSION

The research evaluation phase uses 70% training data and 30% test data with 10 epochs as the main parameter.

The research is divided into four main parts to evaluate the model system that has been built. The first scenario is a performance test of the model that has been built with the original corpus by testing one of the pad sequence parameters on the model with a number from 30 to 100 the length of the padding parameter. As for the second scenario, the same thing is done as the first scenario but with a model built based on a modified corpus. Then from the previous scenario, one model from each corpus was chosen based on its level of performance which is better than the others to be tested as a third scenario by applying the parameters Word2vec vector size 100, 200, and 300 to determine better performance based on the parameters tested. The final stage of the research was carried out by implementing several Telkom University news that was selected and manually labeled based on data from the corpus. The model chosen for implementation is based on the best performance results from scenarios from the original and modified corpus models.

3.1 Scenario 1 Effect of Pad Sequences Parameters on POS Tagging Model

The first test scenario of the research is to determine the best padding parameter dimensions on the bidirectional LSTM POS tagging model built based on the original corpus. The test scenario is to determine the effect of one of the parameters that determine the maximum length of a sentence in building a bidirectional LSTM POS tagging model on performance. The best model results will then be used for the next step of the test scenario. The selected model is then compared with the padded model in the modified corpus. The parameters tested for padding range from 30 to 100 as shown in Table 8.

Table 8. Pad Sequence Parameter Results in the POS Tagging Model Padding Accuracy Precision Recall F1-score

30 96.87 96.87 96.87 96.86

40 96.86 96.86 96.86 96.85

50 96.93 96.93 96.93 96.91

60 96.93 96.94 96.93 96.93

70 96.94 96.93 96.94 96.92

80 96.92 96.91 96.92 96.90

90 96.94 96.94 96.94 96.93

100 96.97 96.97 96.97 96.96

Based on the results of testing in the first scenario for the LSTM bidirectional POS tagging model built on the original corpus. The increase in performance for the padding parameter between 30 and 100 is not too significant. However, based on the consideration that the corpus used has the longest sentence of 82 words so the model chosen for use in the next scene is the padding parameter length of 100. The choice of the padding parameter of 100 is to maximize the training dataset not to be wasted and also to handle the Telkom University news dataset which may have a longer length. of 90 words. In addition, the padding parameter of 100 has a better performance compared to the other padding numbers tested. With a model accuracy of 96.97%, a precision value of 96.97%, a recall value of 96.97%, and an F1-score of 96.96%.

3.2 Scenario 2 Comparison of Modified Corpus to POS Tagging Model Performance

Based on the results of the previous scenario testing, the corpus used to determine the dimensions for the padding parameter is the original corpus which will then be compared with the second test scenario. The second test scenario is to build a model based on a modified corpus against the original corpus. Then do the same thing as in the previous test scenario, namely determining the dimensions for the best padding parameter. The modifications made to the corpus are removing double words in the word class "X" and changing some of the word classes "X"

which should be "FW" in the corpus. Table 9 is the result of a comparison of the model built on the original corpus and the modified corpus.

Table 9. Comparison of Padding Parameters of Original and Modified Corpus Models Corpus Padding Accuracy Precision Recall F1-score

Original

30 96.87 96.87 96.87 96.86

40 96.86 96.86 96.86 96.85

50 96.93 96.93 96.93 96.91

60 96.93 96.94 96.93 96.93

(7)

70 96.94 96.93 96.94 96.92

80 96.92 96.91 96.92 96.90

90 96.94 96.94 96.94 96.93

100 96.97 96.97 96.97 96.96

Modification

30 96.78 96.79 96.78 96.77

40 96.90 96.89 96.90 96.89

50 96.90 96.90 96.91 96.90

60 96.93 96.94 96.94 96.93

70 96.96 96.96 96.96 96.95

80 96.96 96.96 96.96 96.95

90 96.95 96.95 96.95 96.94

100 96.99 97.00 96.99 96.98

It can be seen from the results of the second test scenario in Table 9, it is evident that modifying the word class “X” on the original corpus can affect performance and provide good performance for the built LSTM bidirectional POS tagging model. Eliminating double words marked as writing errors in the word class "X" can reduce model errors in reading sentences that do have repeated words. Some of the "X" word classes also contain meanings that are more likely to be foreign words so they can be changed to the "FW" word class to increase the performance of the model being built. Based on the test results, one model was chosen from each corpus to be tested at a later stage. The second stage of testing has the best accuracy value at 96.99%, 97% precision value, 96.99% recall value, and 96.98% F1-score value which is owned by the modified corpus model.

3.3 Scenario 3 Testing Word2vec Vector Dimensions on the POS tagging Model

The third scenario test is to determine the Word2vec vector dimension used as word embedding in the built POS tagging model. In the previous testing stage, the best model was selected for each of the original and modified corpora which were then tested in the third stage. Both models have the same padding parameter dimension, which is 100. Tests were carried out with Word2vec vector dimensions of 100, 200, and 300 for each model as shown in Table 10 to see the effect on the bidirectional LSTM tagging POS model.

Table 10. Testing on Vector Dimensions of Word2Vec

Corpus Vector Size Accuracy Precision Recall F1-score Original

100 96.81 96.81 96.81 96.80

200 96.84 96.83 96.84 96.83

300 96.94 96.93 96.94 96.93 Modification

100 96.83 96.84 96.83 96.83

200 96.97 96.97 96.97 96.97

300 96.99 97.00 96.99 96.98

Table 10 shows the performance results of the POS tagging model built based on a modified corpus which can improve performance better than the original corpus and reduce the number of vector dimensions for word embedding Word2vec affecting the performance of the bidirectional LSTM POS tagging. Reducing the dimensions of the Word2vec vector to 100 limits the dimensions of the word embedding so that it is not optimal in handling synonyms. Because the Word2vec word embedding model built using data from Wikipedia has a total of 460.575 articles. So to accommodate such a large number of article data a number of vector dimensions greater than 100 is needed. Therefore, the 300 dimension vector was chosen as one of the best compared to the others, even though the performance difference is not too much with the 200 dimension vector. However, the word embedding Word2vec 300 vector dimension model has drawbacks, one of which is the larger model size which is 484 MB compared to the 200 dimension vector model which has a size of 324 MB.

3.4 Comparison of the Performance Results of the Implementation of the POS Tagging Model in Telkom University News

In the third scenario testing stage, one of the best POS tagging models were selected from each original and modified corpus based on the results of the analysis. So that the final stage of the research is to compare the results of the performance of the selected model to news about Telkom University. Three articles were selected as samples manually and the selected articles were manually marked with word classes based on data from the corpus. The three articles are then combined in one file and then tested on each selected model. With a total number of tokens reaching 1.019, the results of the performance of the two models are presented in Table 11.

Table 11. Comparison of Model Performance Tables Corpus Accuracy Precision Recall F1-score Original 91.66 91.83 91.66 91.51

(8)

Corpus Accuracy Precision Recall F1-score Modification 92.74 92.85 92.74 92.48 Table 12. Comparison of Corpus Model Implementation Results

Original Modification

Telkom NNP Telkom NNP

University NNP University NNP

juga RB juga RB

menjadi VB menjadi VB

Perguruan Tinggi NN Perguruan Tinggi NN

Swasta JJ Swasta JJ

( Z ( Z

PTS CD PTS NNP

) Z ) Z

pertama OD pertama OD

ini PR ini PR

menempati VB menempati VB peringkat NN peringkat NN

ke-7 NNP ke-7 CD

setelah SC setelah SC Universitas NN Universitas NN

Airlangga NNP Airlangga NNP

. Z . Z

Based on Table 11, it can be seen the difference in performance between the bidirectional LSTM POS tagging model built using the modified and the original corpus. The performance of models built on the modified corpus has increased by about one percent. By modifying the word class "X", the model is able to provide better performance results. However, based on the results of the analysis, the model that was built several times made mistakes in marking people's names, organization names, and place names as the word class "NNP" which did not exist in the corpus. So that it reduces the performance of the built POS tagging model. Table 12 shows a comparison of the implementation results between the models built with the original and modified corpus. One of the differences lies in the word “PTS” which stands for “Perguruan Tinggi Swasta”, the original corpus model marks it as “CD” which is the cardinal numeral that should answer the question “how much?” whereas for the corpus model the modification marks true as the word class "NNP". For the second row in Table 12, both models have the same error in the word "ke-7" which should be the word class "OD" but for the modified corpus model marking

"CD" has almost the same as "OD" and the word "Universitas" which should be the class word "NNP" both models mark as the class word "NN".

4. CONCLUSION

After the four test scenarios carried out, including testing the selection of the best number of pad sequences parameters, comparison of the results of the pad sequences model built using a modified corpus, testing of word embedding vector dimensions, and comparison of the performance results of models built using the original corpus and modifications to news Indonesian related to Telkom University. In the tests carried out, it can be concluded that by applying a long number of pad sequences of 100 to the model, it is able to improve performance, although not too significant, but with the consideration of news which may have sentences of more than 90 words so that pad sequences of 100 are used. Then the next stage of testing is to compare the results of pad sequences from models that have been built using the original and modified corpus, showing that better performance results are obtained by the modified corpus model. The next thing is testing the dimensions of the word embedding vector, the best results are obtained by the modified corpus model with a total of 300 vector dimensions which get an accuracy of up to 96.99%. Because a larger number of vector dimensions is needed to build a word embedding model based on 460.575 articles. The last thing is to compare the results of implementing the modified and original corpus models on selected news, the results show that the modified corpus model gets the best performance at 92.74% accuracy, 92.85% precision, 92.74% recall, 92.48% F1-score. Based on the test results, modifying the original corpus to build a bidirectional LSTM POS tagging model can improve model performance. By making changes to the word class "X" such as removing double words which are labeled as writing errors and changing to

"FW" for words that tend to be foreign words, it can produce better performance than the unmodified corpus model. One of the weaknesses of the model built is that some make errors in detecting the word class "NNP" for which there is no data in the corpus. This is a consideration for further research by adding a sentence structure that contains a lot of new "NNP" word classes as the main focus.

(9)

REFERENCES

[1] D. R. Bulan, “Bahasa Indonesia sebagai Identitas Nasional Bangsa Indonesia,” J. JISIPOL, vol. 3, no. 2, pp. 23–29, 2019, [Online]. Available: http://ejournal.unibba.ac.id/index.php/jisipol/article/view/115

[2] R. S. Yuwana, A. R. Yuliani, and H. F. Pardede, “On part of speech tagger for Indonesian language,” Proc. - 2017 2nd Int. Conf. Inf. Technol. Inf. Syst. Electr. Eng. ICITISEE 2017, vol. 2018-Janua, pp. 369–372, 2018, doi:

10.1109/ICITISEE.2017.8285530.

[3] M. I. Suri and A. S. Puspaningrum, “Sistem Informasi Manajemen Berita Berbasis Web,” J. Teknol. dan Sist. Inf., vol.

1, no. 1, pp. 8–14, 2020, [Online]. Available: http://jim.teknokrat.ac.id/index.php/sisteminformasi [4] S. Okura, “Embedding-based News Recommendation Topics module First view,” pp. 1933–1942, 2017.

[5] A. Z. Amrullah, R. Hartanto, and I. W. Mustika, “A comparison of different part-of-speech tagging technique for text in Bahasa Indonesia,” Proc. - 2017 7th Int. Annu. Eng. Semin. Ina. 2017, 2017, doi: 10.1109/INAES.2017.8068538.

[6] D. Munandar, E. Suryawati, D. Riswantini, A. F. Abka, R. Wijayanti, and A. Arisal, “POS-tagging for non-english tweets:

An automatic approach: (Study in Bahasa Indonesia),” Proc. - 2017 1st Int. Conf. Informatics Comput. Sci. ICICoS 2017, vol. 2018-Janua, pp. 219–224, 2017, doi: 10.1109/ICICOS.2017.8276365.

[7] D. E. Cahyani and M. J. Vindiyanto, “Indonesian part of speech tagging using hidden markov model - Ngram viterbi,”

2019 4th Int. Conf. Inf. Technol. Inf. Syst. Electr. Eng. ICITISEE 2019, pp. 353–358, 2019, doi:

10.1109/ICITISEE48480.2019.9003989.

[8] D. Q. Nguyen and K. Verspoor, “An improved neural network model for joint POS tagging and dependency parsing,”

CoNLL 2018 - SIGNLL Conf. Comput. Nat. Lang. Learn. Proc. CoNLL 2018 Shar. Task Multiling. Parsing from Raw Text to Univers. Depend., pp. 81–91, 2018, doi: 10.18653/v1/K18-2008.

[9] K. Kurniawan and A. F. Aji, “Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging,” Proc.

2018 Int. Conf. Asian Lang. Process. IALP 2018, pp. 303–307, 2019, doi: 10.1109/IALP.2018.8629236.

[10] M. Maimaiti, A. Wumaier, K. Abiderexiti, and T. Yibulayin, “Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging,” Inf., vol. 8, no. 4, 2017, doi: 10.3390/info8040157.

[11] D. Handrata, C. N. Purwanto, F. H. Chandra, J. Santoso, and Gunawan, “Part of Speech Tagging for Indonesian Language using Bidirectional Long Short-Term Memory,” 2019 1st Int. Conf. Cybern. Intell. Syst. ICORIS 2019, vol. 1, no. August, pp. 85–88, 2019, doi: 10.1109/ICORIS.2019.8874871.

[12] K. K. Purnamasari and I. S. Suwardi, “Rule-based Part of Speech Tagger for Indonesian Language,” IOP Conf. Ser.

Mater. Sci. Eng., vol. 407, no. 1, 2018, doi: 10.1088/1757-899X/407/1/012151.

[13] M. Mursyit, A. P. Wibawa, I. A. E. Zaeni, and H. A. Rosyid, “Pelabelan Kelas Kata Bahasa Jawa Menggunakan Hidden Markov Model,” Mob. Forensics, vol. 2, no. 2, pp. 71–83, 2020, doi: 10.12928/mf.v2i2.2450.

[14] I. G. M. H. Pradiptha and N. A. Sanjaya ER, “Building Balinese Part-of-Speech Tagger Using Hidden Markov Model (HMM),” JELIKU (Jurnal Elektron. Ilmu Komput. Udayana), vol. 9, no. 2, p. 303, 2020, doi:

10.24843/jlk.2020.v09.i02.p18.

[15] N. Schmitt and M. P. H. Rodgers, An Introduction to Applied Linguistics: Third edition. Routledge, 2018. doi:

10.4324/9780429424465.

[16] B. Wang, A. Wang, F. Chen, Y. Wang, and C. C. J. Kuo, “Evaluating word embedding models: Methods and experimental results,” APSIPA Trans. Signal Inf. Process., vol. 8, pp. 1–14, 2019, doi: 10.1017/ATSIP.2019.12.

[17] K. W. Church, “Emerging Trends: Word2Vec,” Nat. Lang. Eng., vol. 23, no. 1, pp. 155–162, 2017, doi:

10.1017/S1351324916000334.

[18] R. D. Deshmukh and A. Kiwelekar, “Deep Learning Techniques for Part of Speech Tagging by Natural Language Processing,” 2nd Int. Conf. Innov. Mech. Ind. Appl. ICIMIA 2020 - Conf. Proc., no. Icimia, pp. 76–81, 2020, doi:

10.1109/ICIMIA48430.2020.9074941.

[19] S. Anbukkarasi and S. Varadhaganapathy, “Deep Learning based Tamil Parts of Speech (POS) Tagger,” Bull. Polish Acad. Sci. Tech. Sci., vol. 69, no. 6, pp. 1–6, 2021, doi: 10.24425/bpasts.2021.138820.

[20] M. Hasnain, M. F. Pasha, I. Ghani, M. Imran, M. Y. Alzahrani, and R. Budiarto, “Evaluating Trust Prediction and Confusion Matrix Measures for Web Services Ranking,” IEEE Access, vol. 8, pp. 90847–90861, 2020, doi:

10.1109/ACCESS.2020.2994222.