People Entity Recognition for the English Quran Translation using BERT

(1)

People Entity Recognition for the English Quran Translation using BERT

Retno Diah Ayu Ningtias^*, Moch. Arif Bijaksana

Informatics, School of Computing, Telkom University, Bandung, Indonesia Email: ^1,*[email protected], ²[email protected]

Correspondence Author Email: [email protected]

Abstract−The Quran is a holy book for Muslims all over the world. Therefore, the Quran is not only translated into Indonesian but also into many other languages, including English. The contents of the Quran are a collection of thousands of verses, each verse having different topics and entities. Sometimes, someone may find it difficult to understand and study the contents of the Quran. Therefore, to make it easier, it is done by extracting information and identifying various entities in the Quran, such as human entities. An important thing to do in order to extract information on human entities is to extract information related to the human entity itself first. Because it can help in the search process, particularly the search for names of people in the Quran.

The extraction of human entities is commonly known as Named Entity Recognition (NER). With NER, it can automatically recognize important entities such as people's names, group names, and other entities in a sentence or verse in the Quran.

Currently, research on the Quran's English translation is not widely done. Therefore, in this research, we are building an information extraction system model for human entities based on a pre-trained deep learning model called Bidirectional Encoder Representations from Transformer (BERT). The dataset used is made up of 19473 tokens and 720 entities taken from the website tanzil.net. The development of the model shows that BERT can be used to extract information for NER on the Quran translation in English by obtaining a F1-score value of 53 %.

Keywords: Quran; Information Extraction: Named Entity Recognition; Extraction of Human Entities; BERT

1. INTRODUCTION

The^l Quran contains 6,200 ve^lrse^ls, spre^lad across 114 chapte^lrs in 30 se^lctions [1]. The^l ve^lrse^ls contain various me^lanings and inte^lrpre^ltations. It is be^llie^lve^ld to have^l be^le^ln dire^lctly re^lve^lale^ld by God through the^l ange^ll Gabrie^ll to the^l Prophe^lt Muhammad and passe^ld down to all humanity through the^l ge^lne^lrations without alte^lration [2]. The^l Quran se^lrve^ls as a guide^l and a source^l of e^lducation for those^l who practice^l and study it. In e^lducation, the^l Quran is ofte^ln use^ld as re^lse^larch mate^lrial to make^l it e^lasie^lr for ne^lw le^larne^lrs to unde^lrstand and study its conte^lnt in de^lpth. One^l way to study the^l Quran is by studying its translations. The^l translations contain various e^lntitie^ls, one^l of which is a human e^lntity. One^l way to unde^lrstand and ide^lntify the^lse^l e^lntitie^ls is through Information E^lxtraction (IE^l).

Information E^lxtraction (IE^l) is the^l proce^lss of e^lxtracting re^lle^lvant information from a te^lxt. This proce^lss can include^l stage^ls such as te^lxt cle^laning, e^lntity re^lcognition, and patte^lrn matching. IE^l e^lxtracts spe^lcific information re^llate^ld to a chose^ln topic from se^lnte^lnce^ls or te^lxt. This proce^lss can be^l use^ld to e^lxtract information from structure^ld, se^lmi-structure^ld, or unstructure^ld te^lxt . IE^l can also be^l use^ld to conve^lrt unstructure^ld information in te^lxt into structure^ld data, making it possible^l to fill re^llational database^ls and be^l furthe^lr proce^lsse^ld by syste^lms [3]. This allows for more^l e^lfficie^lnt and e^lffe^lctive^l data proce^lssing, and make^ls it e^lasie^lr to acce^lss and analyze^l data [4].

Information E^lxtraction re^llie^ls on the^l re^lcognition of name^ld e^lntitie^ls, also known as Name^ld E^lntity Re^lcognition (NE^lR) [5]. NE^lR is the^l first ste^lp towards information e^lxtraction. The^l re^llationship be^ltwe^le^ln the^l two is that NE^lR is part of Information E^lxtraction, whe^lre^l NE^lR aims to e^lxtract name^ld e^lntitie^ls from the^l te^lxt as re^lle^lvant information. In the^l phrase^l "Name^ld E^lntitie^ls," the^l word "name^ld" is use^ld to limit the^l assignme^lnt to only e^lntitie^ls [6].

This way, e^lntitie^ls will be^l re^lcognize^ld first as one^l of se^lve^lral cate^lgorie^ls, such as location (LOC), pe^lrson (PE^lR), or organization (ORG) [7].

Name^ld e^lntity re^lcognition is also use^ld to e^lvaluate^l a pie^lce^l of te^lxt and ide^lntify diffe^lre^lnt e^lntitie^ls from it, which not only corre^lsponds to toke^ln cate^lgorie^ls but also applie^ls to variable^l-le^lngth phrase^ls. The^l mode^ll take^ls into account the^l be^lginning and e^lnd of e^lach re^lle^lvant phrase^l according to the^l place^l classification cate^lgory whe^lre^l it is traine^ld.

Afte^lr the^l information cate^lgory is re^lcognize^ld, information e^lxtraction will e^lxtract information and cre^late^l a machine^l- re^ladable^l docume^lnt, which can the^ln be^l proce^lsse^ld by algorithms to e^lxtract me^laning.

Re^lse^larch re^llate^ld to NE^lR has be^le^ln wide^lly use^ld as a main topic in re^lse^larch re^llate^ld to Information E^lxtraction (IE^l) [8].The^lre^l are^l various me^lthods use^ld to solve^l NE^lR such as Supe^lrvise^ld Le^larning, Unsupe^lrvise^ld Le^larning, Transfe^lr Le^larning, Hybrid Syste^lms, Rule^l-base^ld Syste^lms, and Transforme^lrs [9].The^l choice^l of me^lthod de^lpe^lnds on the^l availability of data, the^l comple^lxity of the^l proble^lm, and the^l limitations of computation. Howe^lve^lr, as te^lchnology advance^ls and large^lr datase^lts be^lcome^l available^l, de^le^lp le^larning me^lthods such as BE^lRT and Transforme^lrs are^l be^lcoming the^l pre^lfe^lrre^ld me^lthod. A study on BERT for NER has also been conducted by Chantana Chantrapornchai1 and Aphisit Tunsakul for the tourism industry, using information about restaurants, hotels, shopping, and tourism. The experimental results from the review text in the tourism domain show how to build a model to extract the desired entities such as name, location, or facilities, and relationship type. The accuracy of the named entity recognition test data set for BERT is up to 99% [10].

This is the^l basis on which the^l author will build a syste^lm for e^lxtracting human e^lntitie^ls from the^l E^lnglish translation of the^l Quran using the^l BE^lRT me^lthod. BE^lRT is a ne^lw language^l re^lpre^lse^lntation mode^ll de^lsigne^ld to train

(2)

BE^lRT also has a ve^lry compe^ltitive^l pe^lrformance^l in the^l use^l of E^lnglish in NE^lR [11]. Howe^lve^lr, for human e^lntity e^lxtraction, it only focuse^ls on ide^lntifying human e^lntitie^ls such as human name^ls or words re^lpre^lse^lnting human e^lntity de^lscriptions such as "pe^lople^l". There have been several previous studies that perform this task, one of which is the extraction of human entities in the English Quran corpus using the Support Vector Machines (SVM) method, resulting in an accuracy rate of 75% [12].

The goal of building this system is to facilitate someone's understanding of the Quran. The input of the human entity extraction system that is built using the Quran with English translation, and the output is the human entities extracted by the system. The system focuses only on information retrieval of the names of people entities in the Quran. For clarity, an example is given of a verse in Surah Al-Baqarah verse 82 "And those who believe and do good deeds, they are the inhabitants of heaven. They will remain in it." In this translation, the human entity will recognize the words 'those who believe' because it is a human entity. So the output is 'those who believe'. The data we obtained is still in raw text form and is not structured. Therefore, preprocessing must be done first. At the end of the research, an evaluation is done by calculating the F1-score. The F1-score value in this calculation is used as a reference for the performance of the NER system model that has been built.

2. RESEARCH METHODOLOGY

2.1 System Flow

In this re^lse^larch, the^l BE^lRT mode^ll is use^ld to e^lxtract human e^lntitie^ls from the^l E^lnglish translation of the^l Quran. The^l mode^ll is traine^ld using the^l E^lnglish Corpus Quran datase^lt for the^l training and e^lvaluation proce^lsse^ls. Be^lfore^l training the^l mode^ll, the^l datase^lt is pre^l-proce^lsse^ld and divide^ld into two se^lts, one^l for training the^l mode^ll and the^l othe^lr for e^lvaluating the^l mode^ll. Afte^lr training, the^l mode^ll is te^lste^ld by pre^ldicting labe^lls on ne^lw data that has not be^le^ln se^le^ln during the^l training phase^l. The^l te^lst re^lsults are^l the^ln analyze^ld to de^lte^lrmine^l the^l mode^ll's ability to e^lxtract human e^lntitie^ls. Figure^l 1 shows the^l te^lsting proce^lss from start to finish, which is built for Name^ld E^lntity Re^lcognition (NE^lR) spe^lcifically for human e^lntitie^ls and pe^lople^l's name^ls in the^l E^lnglish translation of the^l Quran using the^l BE^lRT me^lthod.

Figure 1. Syste^lm De^lsign 2.2 Data Pre-processing

Data Pre^l-Proce^lssing Be^lfore^l using the^l BE^lRT mode^ll to classify the^l e^lntitie^ls of a toke^ln, it is ne^lce^lssary to pe^lrform pre^lproce^lssing on the^l data first [13]. The^l data will be^l pre^lproce^lsse^ld be^lfore^l it e^lnte^lrs the^l training and te^lsting phase^l. The^l pre^lproce^lssing that is done^l on the^l data is done^l using the^l following te^lchnique^ls:

(3)

This is the^l proce^lss of re^lmoving or de^lle^lting punctuation marks found in a se^lnte^lnce^l, such as: '.', '?', ';', e^ltc.

2. Toke^lnization

Toke^lnization is the^l proce^lss of bre^laking down a se^lnte^lnce^l into smalle^lr units calle^ld toke^lns, whe^lre^l a toke^ln he^lre^l re^lfe^lrs to a word.

3. POS-Tagging

POS Tagging (part-of-spe^le^lch tagging) is an NLP task that marks words in a se^lnte^lnce^l so that the^ly be^lcome^l the^l function of a spe^lcific language^l se^lnte^lnce^l structure^l.POS tagging is the^l labe^lling of words in a se^lnte^lnce^l to the^l function of a spe^lcific language^l se^lnte^lnce^l structure^l or part of spe^le^lch.Some^l e^lxample^ls of NLP tasks that are^l done^l with POS Tagging include^l NE^lR, machine^l translation, and constitue^lnt parsing [14]. Using this POS Tagging fe^lature^l can also have^l an impact on the^l be^lst-pe^lrforming mode^ll's pe^lrformance^l [15].

2.3 Data Split

Afte^lr applying pre^lproce^lssing, the^l re^lse^larche^lr the^ln manually labe^lls the^l toke^lns with raw te^lchnique^l labe^lls within the^l datase^lt, making it re^lady for use^l. To train and te^lst the^l mode^ll, two se^lparate^l and non-ove^lrlapping se^lts of data are^l re^lquire^ld. The^l datase^lt is divide^ld into two parts: the^l training data and the^l te^lst data. The^l training data is 80% of the^l total data, and the^l te^lst data is 20% of the^l total data.

2.4 BIO Format

The BIO format, or the full form of Beginning-Inside-Outside, is a tagging format used in NER (Named Entity Recognition). The BIO format consists of a prefix B- before the tag, indicating that the tag is at the beginning of the chunk. The prefix I- before the tag indicates that the tag is within the chunk, and the tag O indicates that the tag is not an entity.

2.5 BERT

BE^lRT is a state^l-of-the^l-art language^l mode^ll that can le^larn de^le^lp, bidire^lctional re^lpre^lse^lntations of te^lxt by looking at the^l conte^lxt be^lfore^l and afte^lr e^lach word in the^l te^lxt in all laye^lrs of the^l mode^ll. BE^lRT is conce^lptually simple^l and e^lmpirically powe^lrful. BE^lRT also has a ve^lry compe^ltitive^l pe^lrformance^l for the^l E^lnglish use^ld in NE^lR [11]. BE^lRT, which is de^lsigne^ld to pe^lrform pre^l-training for language^l mode^lls, has be^le^ln prove^ln to e^lffe^lctive^lly improve^l many natural language^l proce^lssing tasks. The^ln, the^l re^lsult of the^l pre^l-training allows BE^lRT to pe^lrform fine^l-tuning with one^l additional output laye^lr [11]. Fine^l-tuning approache^ls such as transforme^lrs introduce^l task-spe^lcific parame^lte^lrs and can be^l traine^ld on spe^lcific tasks only by fine^l-tuning all pre^l-traine^ld parame^lte^lrs. This approach has the^l same^l function and purpose^l, which is to use^l a language^l mode^ll to le^larn ge^lne^lral language^l re^lpre^lse^lntations. The^l BE^lRT mode^ll structure^l is compose^ld of multiple^l laye^lrs of transforme^lr e^lncode^lrs that can proce^lss the^l input in both dire^lctions. The^l way BE^lRT works is divide^ld into two parts:

1. Pre^l-training, In this Pre^l-training, In this proce^lss, the^l BE^lRT mode^ll will be^l traine^ld with unlabe^lle^ld data. He^lre^l are^l two training tasks for BE^lRT on unlabe^lle^ld data:

a. Maske^ld LM, in this task, some^l toke^lns of the^l input will be^l randomly maske^ld with the^l goal of pre^ldicting the^l id of the^l original vocabulary that is obscure^ld.

b. Ne^lxt Se^lnte^lnce^l Pre^ldiction (NSP), which is a pre^ldiction for the^l ne^lxt se^lnte^lnce^l. In this task, pre^l-training will be^l done^l by se^lle^lcting se^lnte^lnce^l X and se^lnte^lnce^l Y. Whe^lre^l se^lnte^lnce^l Y has a 50% chance^l of be^ling the^l se^lnte^lnce^l following se^lnte^lnce^l X and will be^l give^ln the^l labe^ll "isNe^lxt" and 50% as a random se^lnte^lnce^l in the^l corpus and will be^l give^ln the^l labe^ll "notNe^lxt

2. Fine^l-tuning, The^l traine^ld parame^lte^lrs will the^ln be^l fine^l-tune^ld for downstre^lam tasks, such as te^lxt classification, Que^lstion and Answe^lring, e^ltc. Figure^l 2 shows the^l use^l of the^l BE^lRT mode^ll for NE^lR.

Figure 2. BERT for NER

(4)

The^l syste^lm e^lvaluation is me^lant to de^lte^lrmine^l how accurate^l the^l syste^lm is at classifying. In this re^lse^larch, the^lre^l are^l thre^le^l main me^lasurable^l matrice^ls use^ld to me^lasure^l the^l pe^lrformance^l of e^lach mode^ll to be^l built. E^lvaluation is done^l by calculating the^l F1-score^l. F1-score^l is the^l ratio of pre^lcision to re^lcall, which me^lasure^ls how we^lll the^l mode^ll re^lcognize^ls the^l actual e^lntitie^ls. F1-score^l is a suitable^l matrix to use^l in NE^lR tasks, as it take^ls into account both pre^lcision and re^lcall in one^l me^ltric. Name^ld e^lntity pre^ldictions are^l conside^lre^ld corre^lct if the^l e^lntitie^ls e^lxtracte^ld by the^l syste^lm match the^l actual e^lntitie^ls in the^l data. Me^lasuring F1 re^lquire^ls accuracy and re^lcall. The^l F1 value^l in the^l calculation will be^l use^ld as a re^lfe^lre^lnce^l for the^l pe^lrformance^l of the^l NE^lR mode^ll built in this re^lse^larch. Pre^lcision is the^l calculation of the^l le^lve^ll of accuracy be^ltwe^le^ln the^l information re^lque^lste^ld and the^l answe^lrs or outputs in the^l form of the^l pe^lrce^lntage^l of name^ld e^lntitie^ls corre^lctly found by the^l syste^lm. The^l pe^lrce^lntage^l of name^ld e^lntitie^ls found by the^l syste^lm is re^lfe^lrre^ld to as re^lcall.

3. RESULT AND DISCUSSION

In this re^lse^larch, the^l datase^lt use^ld is the^l E^lnglish translation of the^l Quran, with a datase^lt of 19474 toke^lns se^lle^lcte^ld from Quran Surahs Al-Fatihah, Al-Baqara, and Al-Imran. This study's datase^lt is divide^ld into two parts: data training and data te^lsting. In this re^lse^larch, the^l tagging sche^lme^l use^ld is the^l BIO format, with the^l datase^lt storage^l format in line^l with the^l datase^lt on the^l NE^lR CoNLL-2003. This re^lse^larch only use^ls the^l PE^lR (Pe^lrson) e^lntity tag, which re^lpre^lse^lnts the^l e^lntity of a pe^lrson. Othe^lr e^lntitie^ls are^l de^lnote^ld by the^l le^ltte^lr O (Othe^lr). The^l datase^lt is made^l up of individual words or phrase^ls (e^lntitie^ls) and labe^lls that indicate^l whe^lthe^lr e^lach e^lntity is the^l be^lginning (Be^lgin), a continuation (Inside^l), or not part of a name^ld e^lntity (Outside^l) using the^l BIO notation. The^l datase^lt also include^ls POS fe^lature^ls for e^lach word obtaine^ld in the^l pre^l-proce^lssing stage^l.

The following is the result of the preprocessing performed on the data that has been obtained. Table^l 1 shows the^l proce^lss at this stage^l whe^lre^l the^l te^lxt still contains symbols or punctuation marks, which are^l the^ln re^lmove^ld to make^l the^l syste^lm e^lasie^lr to build.

Table 1. E^lxample^l of punctuation re^lmoval

Input Output

And We^l gave^l Je^lsus, the^l son of Mary, And we^l gave^l Je^lsus the^l son of Mary

From Tablel 2, it can bel seleln that thel purposel of this procelss is to delcodel elach word in it into words markeld with "," followeld by a spacel.

Table 2. Elxample^l of toke^lnization

Input Output

And We^l gave^l Je^lsus, the^l son of Mary, And , we^l , gave^l , Je^lsus , the^l , son , of , Mary

From Table^l 3, it can be^l se^le^ln that the^l purpose^l of this proce^lss is to provide^l labe^lls for words according to the^l rule^ls.

Table 3. E^lxample^l of punctuation re^lmoval Word POS Tagger

Je^lsus NNP

the^l DT

son NN

of IN

Mary NNP

Table^l 4 shows the^l structure^l of the^l datase^lt use^ld, take^ln from a portion of the^l ve^lrse^l in the^l E^lnglish translation of the^l Quran. The^l datase^lt use^ld can be^l found at : https://bit.ly/datase^ltNE^lR

Table 4. E^lxample^l of datase^lt

Chapter Verse Word POS Tagger BIO Label

2 82

But CC O

the^ly PRP B-PE^lR

who WP I-PE^lR

be^llie^lve^l VBP I-PE^lR

and CC O

do VBP O

re^lighte^lous JJ O

de^le^lds NNS O

(5)

those^l DT O

are^l VBP O

the^l DT O

companions NNS B-PE^lR

of IN I-PE^lR

paradise^l NNP I-PE^lR

Afte^lr the^l datase^lt is colle^lcte^ld, pre^l-proce^lssing is pe^lrforme^ld on the^l datase^lt, which include^ls splitting the^l data into training and te^lsting se^lts. The^l final ste^lp is to e^lvaluate^l the^l syste^lm that has be^le^ln built. From the^l re^lse^larch that has be^le^ln done^l using the^l BE^lRT me^lthod to e^lxtract human e^lntitie^ls from the^l E^lnglish translation of the^l Quran, Table^l 2 shows the^l re^lsults obtaine^ld from the^l te^lsting. Table^l 5 shows that the^l BE^lRT me^lthod produce^ls an F1 score^l with a value^l of 53%.

Table 5. Pe^lrformance^l Re^lsults of the^l BE^lR Me^lthod Model Fitur Precission Recall F1 Score

BE^lRT BE^lRT Base^l 0.63 0.46 0.53

Additionally, to show the^l re^lsults of the^l BE^lRT mode^ll in classifying the^l labe^lls B-PE^lR, I-PE^lR, and O, the^l re^lsults can be^l se^le^ln in Table^l 6 for corre^lct pre^ldictions and Table^l 7 for incorre^lct pre^ldictions.

Table 6. E^lxample^l of re^lsults from a BE^lRT mode^ll showing corre^lct pre^ldictions Word True Label Model Prediction

BERT

But O O

the^ly B-PE^lR B-PE^lR

who I-PE^lR I-PE^lR

be^llie^lve^l I-PE^lR I-PE^lR

and O O

do O O

re^lighte^lous O O

de^le^lds O O

Table 7. E^lxample^l of re^lsults from a BE^lRT mode^ll showing wrong pre^ldictions Word True Label Model Prediction

BERT

The^l O O

path O O

of O O

those^l B-PE^lR B-PE^lR upon I-PE^lR I-PE^lR whom I-PE^lR I-PE^lR

you I-PE^lR O

have^l I-PE^lR O

be^lstowe^ld I-PE^lR O

favor I-PE^lR O

According to Table^l 6, the^l mode^ll can corre^lctly pre^ldict for e^lach e^lntity be^lcause^l in the^l training data, e^lntitie^ls such as non-be^llie^lve^lrs most fre^lque^lntly appe^lar as be^llie^lve^lrs. Howe^lve^lr, base^ld on Table^l 7, the^l mode^ll cannot corre^lctly pre^ldict some^l e^lntitie^ls that are^l part of multi-phrase^l e^lntitie^ls.

Base^ld on the^l re^lsults obtaine^ld from the^l te^lsts and the^l analysis that has be^le^ln done^l, it can be^l conclude^ld that the^l BE^lRT me^lthod can be^l use^ld for the^l de^lve^llopme^lnt of an e^lntity re^lcognition syste^lm in the^l Quran, but human proofing and corre^lction from e^lxpe^lrts are^l ne^le^lde^ld during the^l proce^lss of providing e^lntity labe^lls from the^l datase^lt to re^lduce^l labe^lling e^lrrors.

4. CONCLUSION

In this re^lse^larch, we^l use^ld the^l BE^lRT me^lthod to e^lxtract human e^lntitie^ls from the^l E^lnglish translation of the^l Quran.

Base^ld on the^l re^lsults of the^l e^lxpe^lrime^lnts conducte^ld, we^l found that de^le^lp le^larning-base^ld mode^lls can be^l use^ld for e^lxtracting human e^lntitie^ls. BE^lRT's difficulty in e^lxtracting human e^lntitie^ls in the^l E^lnglish translation of the^l Quran re^lsulte^ld in an F1 score^l of 53%. We^l also found that the^l mode^ll built had difficulty classifying ne^lste^ld phrase^ls, indicating that its pe^lrformance^l in e^lntity e^lxtraction is not ye^lt optimal. For furthe^lr re^lse^larch, due^l to the^l lack of training

(6)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 541 data, which cause^ls low pe^lrformance^l whe^ln e^lxtracting human e^lntitie^ls, adding training data can be^l done^l to improve^l the^l pe^lrformance^l of the^l mode^ll to be^l built. Howe^lve^lr, the^l use^l of alte^lrnative^l training strate^lgie^ls can also be^l applie^ld to improve^l the^l mode^ll's ove^lrall pe^lrformance^l. In this re^lse^larch, e^lntitie^ls we^lre^l limite^ld to name^ls and pe^lople^l only. For furthe^lr re^lse^larch, the^l le^lve^ll of e^lntitie^ls can be^l incre^lase^ld to include^l ne^lste^ld e^lntitie^ls so that it can e^lxtract be^ltte^lr and more^l comple^lx e^lntitie^ls. The^lre^lfore^l, the^l authors sugge^lst to future^l re^lse^larche^lrs that the^ly be^l able^l to conduct e^lxpe^lrime^lnts with more^l datase^lts so that the^ly can produce^l a be^ltte^lr syste^lm for e^lxtracting human e^lntitie^ls from the^l E^lnglish translation of the^l Quran.

ACKNOWLEDGMENT

We are very grateful to Muhammad Aris Maulana who has uploaded an open source dataset of the English translation of the Al Quran on his personal github.

REFERENCES

[1] S. Hossein Nasr, The Study Quran, 1st ed. New York: Harper One, 2015.

[2] A. Drajat, Ulumul Qur’an Pengantar Ilmu-Ilmu Al-Qur’an, 1st ed. Depok: Kencana, 2017.

[3] R. Grishman, “Information Extraction,” IEEE Intell Syst, vol. 30, no. 5, pp. 8–15, Sep. 2015, doi: 10.1109/MIS.2015.68.

[4] “Speech and Language Processing.” https://web.stanford.edu/~jurafsky/slp3/ (accessed May 26, 2022).

[5] A. Goyal, M. Kumar, and V. Gupta, “Named Entity Recognition: Applications, Approaches and Challenges”.

[6] S. Malmasi, A. Fang, B. Fetahu, S. Kar, and O. Rokhlenko, “MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition,” Aug. 2022, doi: 10.48550/arxiv.2208.14536.

[7] T. Al-Moslmi, M. Gallofre Ocana, A. L. Opdahl, and C. Veres, “Named Entity Extraction for Knowledge Graphs: A Literature Overview,” IEEE Access, vol. 8, pp. 32862–32881, 2020, doi: 10.1109/ACCESS.2020.2973928.

[8] “7 NLP Techniques for Extracting Information from Unstructured Text using Algorithms | Width.ai.”

https://www.width.ai/post/extracting-information-from-unstructured-text-using-algorithms (accessed Jan. 16, 2023).

[9] G. Popovski, B. K. Seljak, and T. Eftimov, “A Survey of Named-Entity Recognition Methods for Food Information Extraction,” IEEE Access, vol. 8, pp. 31586–31594, 2020, doi: 10.1109/ACCESS.2020.2973502.

[10] C. Chantrapornchai and A. Tunsakul, “Information extraction on tourism domain using SpaCy and BERT,” ECTI Transactions on Computer and Information Technology, vol. 15, no. 1, pp. 108–122, Apr. 2021, doi: 10.37936/ecti- cit.2021151.228621.

[11] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arxiv.org, Accessed: May 26, 2022. [Online]. Available: https://arxiv.org/abs/1810.04805 [12] M. Aris, M. #1, M. Arif, B. #2, and A. Fatchul Huda, “Entity Recognition for Quran English Version with Supervised

Learning Approach,” socj.telkomuniversity.ac.id, doi: 10.21108/indojc.2019.4.3.362.

[13] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/J.GLTP.2022.04.020.

[14] K. Kurniawan and A. Fikri Aji, “Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging”, doi:

10.1109/IALP.2018.8629236.

[15] R. MAK, M. Bijaksana, A. H.-P. C. Science, and undefined 2019, “Person entity recognition for the Indonesian Qur’an translation with the approach hidden Markov model-viterbi,” Elsevier, Accessed: May 26, 2022. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S1877050919310786