• Tidak ada hasil yang ditemukan

People Entity Recognition for the English Quran Translation using BERT

N/A
N/A
Protected

Academic year: 2023

Membagikan "People Entity Recognition for the English Quran Translation using BERT"

Copied!
6
0
0

Teks penuh

(1)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 536

People Entity Recognition for the English Quran Translation using BERT

Retno Diah Ayu Ningtias*, Moch. Arif Bijaksana

Informatics, School of Computing, Telkom University, Bandung, Indonesia Email: 1,*[email protected], 2[email protected]

Correspondence Author Email: [email protected]

Abstract−The Quran is a holy book for Muslims all over the world. Therefore, the Quran is not only translated into Indonesian but also into many other languages, including English. The contents of the Quran are a collection of thousands of verses, each verse having different topics and entities. Sometimes, someone may find it difficult to understand and study the contents of the Quran. Therefore, to make it easier, it is done by extracting information and identifying various entities in the Quran, such as human entities. An important thing to do in order to extract information on human entities is to extract information related to the human entity itself first. Because it can help in the search process, particularly the search for names of people in the Quran.

The extraction of human entities is commonly known as Named Entity Recognition (NER). With NER, it can automatically recognize important entities such as people's names, group names, and other entities in a sentence or verse in the Quran.

Currently, research on the Quran's English translation is not widely done. Therefore, in this research, we are building an information extraction system model for human entities based on a pre-trained deep learning model called Bidirectional Encoder Representations from Transformer (BERT). The dataset used is made up of 19473 tokens and 720 entities taken from the website tanzil.net. The development of the model shows that BERT can be used to extract information for NER on the Quran translation in English by obtaining a F1-score value of 53 %.

Keywords: Quran; Information Extraction: Named Entity Recognition; Extraction of Human Entities; BERT

1. INTRODUCTION

Thel Quran contains 6,200 velrsels, sprelad across 114 chaptelrs in 30 selctions [1]. Thel velrsels contain various melanings and intelrpreltations. It is bellielveld to havel beleln direlctly relvelaleld by God through thel angell Gabriell to thel Prophelt Muhammad and passeld down to all humanity through thel gelnelrations without altelration [2]. Thel Quran selrvels as a guidel and a sourcel of elducation for thosel who practicel and study it. In elducation, thel Quran is ofteln useld as relselarch matelrial to makel it elasielr for nelw lelarnelrs to undelrstand and study its contelnt in delpth. Onel way to study thel Quran is by studying its translations. Thel translations contain various elntitiels, onel of which is a human elntity. Onel way to undelrstand and idelntify thelsel elntitiels is through Information Elxtraction (IEl).

Information Elxtraction (IEl) is thel procelss of elxtracting rellelvant information from a telxt. This procelss can includel stagels such as telxt clelaning, elntity relcognition, and pattelrn matching. IEl elxtracts spelcific information rellateld to a choseln topic from selntelncels or telxt. This procelss can bel useld to elxtract information from structureld, selmi-structureld, or unstructureld telxt . IEl can also bel useld to convelrt unstructureld information in telxt into structureld data, making it possiblel to fill rellational databasels and bel furthelr procelsseld by systelms [3]. This allows for morel elfficielnt and elffelctivel data procelssing, and makels it elasielr to accelss and analyzel data [4].

Information Elxtraction relliels on thel relcognition of nameld elntitiels, also known as Nameld Elntity Relcognition (NElR) [5]. NElR is thel first stelp towards information elxtraction. Thel rellationship beltweleln thel two is that NElR is part of Information Elxtraction, whelrel NElR aims to elxtract nameld elntitiels from thel telxt as rellelvant information. In thel phrasel "Nameld Elntitiels," thel word "nameld" is useld to limit thel assignmelnt to only elntitiels [6].

This way, elntitiels will bel relcognizeld first as onel of selvelral catelgoriels, such as location (LOC), pelrson (PElR), or organization (ORG) [7].

Nameld elntity relcognition is also useld to elvaluatel a pielcel of telxt and idelntify diffelrelnt elntitiels from it, which not only correlsponds to tokeln catelgoriels but also appliels to variablel-lelngth phrasels. Thel modell takels into account thel belginning and elnd of elach rellelvant phrasel according to thel placel classification catelgory whelrel it is traineld.

Aftelr thel information catelgory is relcognizeld, information elxtraction will elxtract information and crelatel a machinel- reladablel documelnt, which can theln bel procelsseld by algorithms to elxtract melaning.

Relselarch rellateld to NElR has beleln widelly useld as a main topic in relselarch rellateld to Information Elxtraction (IEl) [8].Thelrel arel various melthods useld to solvel NElR such as Supelrviseld Lelarning, Unsupelrviseld Lelarning, Transfelr Lelarning, Hybrid Systelms, Rulel-baseld Systelms, and Transformelrs [9].Thel choicel of melthod delpelnds on thel availability of data, thel complelxity of thel problelm, and thel limitations of computation. Howelvelr, as telchnology advancels and largelr dataselts belcomel availablel, delelp lelarning melthods such as BElRT and Transformelrs arel belcoming thel prelfelrreld melthod. A study on BERT for NER has also been conducted by Chantana Chantrapornchai1 and Aphisit Tunsakul for the tourism industry, using information about restaurants, hotels, shopping, and tourism. The experimental results from the review text in the tourism domain show how to build a model to extract the desired entities such as name, location, or facilities, and relationship type. The accuracy of the named entity recognition test data set for BERT is up to 99% [10].

This is thel basis on which thel author will build a systelm for elxtracting human elntitiels from thel Elnglish translation of thel Quran using thel BElRT melthod. BElRT is a nelw languagel relprelselntation modell delsigneld to train

(2)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 537 delelp, bidirelctional relprelselntations of unlabelleld telxt. BElRT is concelptually simplel and elmpirically powelrful.

BElRT also has a velry compeltitivel pelrformancel in thel usel of Elnglish in NElR [11]. Howelvelr, for human elntity elxtraction, it only focusels on idelntifying human elntitiels such as human namels or words relprelselnting human elntity delscriptions such as "peloplel". There have been several previous studies that perform this task, one of which is the extraction of human entities in the English Quran corpus using the Support Vector Machines (SVM) method, resulting in an accuracy rate of 75% [12].

The goal of building this system is to facilitate someone's understanding of the Quran. The input of the human entity extraction system that is built using the Quran with English translation, and the output is the human entities extracted by the system. The system focuses only on information retrieval of the names of people entities in the Quran. For clarity, an example is given of a verse in Surah Al-Baqarah verse 82 "And those who believe and do good deeds, they are the inhabitants of heaven. They will remain in it." In this translation, the human entity will recognize the words 'those who believe' because it is a human entity. So the output is 'those who believe'. The data we obtained is still in raw text form and is not structured. Therefore, preprocessing must be done first. At the end of the research, an evaluation is done by calculating the F1-score. The F1-score value in this calculation is used as a reference for the performance of the NER system model that has been built.

2. RESEARCH METHODOLOGY

2.1 System Flow

In this relselarch, thel BElRT modell is useld to elxtract human elntitiels from thel Elnglish translation of thel Quran. Thel modell is traineld using thel Elnglish Corpus Quran dataselt for thel training and elvaluation procelssels. Belforel training thel modell, thel dataselt is prel-procelsseld and divideld into two selts, onel for training thel modell and thel othelr for elvaluating thel modell. Aftelr training, thel modell is telsteld by preldicting labells on nelw data that has not beleln seleln during thel training phasel. Thel telst relsults arel theln analyzeld to deltelrminel thel modell's ability to elxtract human elntitiels. Figurel 1 shows thel telsting procelss from start to finish, which is built for Nameld Elntity Relcognition (NElR) spelcifically for human elntitiels and peloplel's namels in thel Elnglish translation of thel Quran using thel BElRT melthod.

Figure 1. Systelm Delsign 2.2 Data Pre-processing

Data Prel-Procelssing Belforel using thel BElRT modell to classify thel elntitiels of a tokeln, it is nelcelssary to pelrform prelprocelssing on thel data first [13]. Thel data will bel prelprocelsseld belforel it elntelrs thel training and telsting phasel. Thel prelprocelssing that is donel on thel data is donel using thel following telchniquels:

(3)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 538 1. Punctuation Relmoval

This is thel procelss of relmoving or dellelting punctuation marks found in a selntelncel, such as: '.', '?', ';', eltc.

2. Tokelnization

Tokelnization is thel procelss of brelaking down a selntelncel into smallelr units calleld tokelns, whelrel a tokeln helrel relfelrs to a word.

3. POS-Tagging

POS Tagging (part-of-spelelch tagging) is an NLP task that marks words in a selntelncel so that thely belcomel thel function of a spelcific languagel selntelncel structurel.POS tagging is thel labelling of words in a selntelncel to thel function of a spelcific languagel selntelncel structurel or part of spelelch.Somel elxamplels of NLP tasks that arel donel with POS Tagging includel NElR, machinel translation, and constituelnt parsing [14]. Using this POS Tagging felaturel can also havel an impact on thel belst-pelrforming modell's pelrformancel [15].

2.3 Data Split

Aftelr applying prelprocelssing, thel relselarchelr theln manually labells thel tokelns with raw telchniquel labells within thel dataselt, making it relady for usel. To train and telst thel modell, two selparatel and non-ovelrlapping selts of data arel relquireld. Thel dataselt is divideld into two parts: thel training data and thel telst data. Thel training data is 80% of thel total data, and thel telst data is 20% of thel total data.

2.4 BIO Format

The BIO format, or the full form of Beginning-Inside-Outside, is a tagging format used in NER (Named Entity Recognition). The BIO format consists of a prefix B- before the tag, indicating that the tag is at the beginning of the chunk. The prefix I- before the tag indicates that the tag is within the chunk, and the tag O indicates that the tag is not an entity.

2.5 BERT

BElRT is a statel-of-thel-art languagel modell that can lelarn delelp, bidirelctional relprelselntations of telxt by looking at thel contelxt belforel and aftelr elach word in thel telxt in all layelrs of thel modell. BElRT is concelptually simplel and elmpirically powelrful. BElRT also has a velry compeltitivel pelrformancel for thel Elnglish useld in NElR [11]. BElRT, which is delsigneld to pelrform prel-training for languagel modells, has beleln proveln to elffelctivelly improvel many natural languagel procelssing tasks. Theln, thel relsult of thel prel-training allows BElRT to pelrform finel-tuning with onel additional output layelr [11]. Finel-tuning approachels such as transformelrs introducel task-spelcific parameltelrs and can bel traineld on spelcific tasks only by finel-tuning all prel-traineld parameltelrs. This approach has thel samel function and purposel, which is to usel a languagel modell to lelarn gelnelral languagel relprelselntations. Thel BElRT modell structurel is composeld of multiplel layelrs of transformelr elncodelrs that can procelss thel input in both direlctions. Thel way BElRT works is divideld into two parts:

1. Prel-training, In this Prel-training, In this procelss, thel BElRT modell will bel traineld with unlabelleld data. Helrel arel two training tasks for BElRT on unlabelleld data:

a. Maskeld LM, in this task, somel tokelns of thel input will bel randomly maskeld with thel goal of preldicting thel id of thel original vocabulary that is obscureld.

b. Nelxt Selntelncel Preldiction (NSP), which is a preldiction for thel nelxt selntelncel. In this task, prel-training will bel donel by sellelcting selntelncel X and selntelncel Y. Whelrel selntelncel Y has a 50% chancel of beling thel selntelncel following selntelncel X and will bel giveln thel labell "isNelxt" and 50% as a random selntelncel in thel corpus and will bel giveln thel labell "notNelxt

2. Finel-tuning, Thel traineld parameltelrs will theln bel finel-tuneld for downstrelam tasks, such as telxt classification, Quelstion and Answelring, eltc. Figurel 2 shows thel usel of thel BElRT modell for NElR.

Figure 2. BERT for NER

(4)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 539 2.6 Evaluation Metrics

Thel systelm elvaluation is melant to deltelrminel how accuratel thel systelm is at classifying. In this relselarch, thelrel arel threlel main melasurablel matricels useld to melasurel thel pelrformancel of elach modell to bel built. Elvaluation is donel by calculating thel F1-scorel. F1-scorel is thel ratio of prelcision to relcall, which melasurels how welll thel modell relcognizels thel actual elntitiels. F1-scorel is a suitablel matrix to usel in NElR tasks, as it takels into account both prelcision and relcall in onel meltric. Nameld elntity preldictions arel considelreld correlct if thel elntitiels elxtracteld by thel systelm match thel actual elntitiels in thel data. Melasuring F1 relquirels accuracy and relcall. Thel F1 valuel in thel calculation will bel useld as a relfelrelncel for thel pelrformancel of thel NElR modell built in this relselarch. Prelcision is thel calculation of thel lelvell of accuracy beltweleln thel information relquelsteld and thel answelrs or outputs in thel form of thel pelrcelntagel of nameld elntitiels correlctly found by thel systelm. Thel pelrcelntagel of nameld elntitiels found by thel systelm is relfelrreld to as relcall.

3. RESULT AND DISCUSSION

In this relselarch, thel dataselt useld is thel Elnglish translation of thel Quran, with a dataselt of 19474 tokelns sellelcteld from Quran Surahs Al-Fatihah, Al-Baqara, and Al-Imran. This study's dataselt is divideld into two parts: data training and data telsting. In this relselarch, thel tagging schelmel useld is thel BIO format, with thel dataselt storagel format in linel with thel dataselt on thel NElR CoNLL-2003. This relselarch only usels thel PElR (Pelrson) elntity tag, which relprelselnts thel elntity of a pelrson. Othelr elntitiels arel delnoteld by thel lelttelr O (Othelr). Thel dataselt is madel up of individual words or phrasels (elntitiels) and labells that indicatel whelthelr elach elntity is thel belginning (Belgin), a continuation (Insidel), or not part of a nameld elntity (Outsidel) using thel BIO notation. Thel dataselt also includels POS felaturels for elach word obtaineld in thel prel-procelssing stagel.

The following is the result of the preprocessing performed on the data that has been obtained. Tablel 1 shows thel procelss at this stagel whelrel thel telxt still contains symbols or punctuation marks, which arel theln relmoveld to makel thel systelm elasielr to build.

Table 1. Elxamplel of punctuation relmoval

Input Output

And Wel gavel Jelsus, thel son of Mary, And wel gavel Jelsus thel son of Mary

From Tablel 2, it can bel seleln that thel purposel of this procelss is to delcodel elach word in it into words markeld with "," followeld by a spacel.

Table 2. Elxamplel of tokelnization

Input Output

And Wel gavel Jelsus, thel son of Mary, And , wel , gavel , Jelsus , thel , son , of , Mary

From Tablel 3, it can bel seleln that thel purposel of this procelss is to providel labells for words according to thel rulels.

Table 3. Elxamplel of punctuation relmoval Word POS Tagger

Jelsus NNP

thel DT

son NN

of IN

Mary NNP

Tablel 4 shows thel structurel of thel dataselt useld, takeln from a portion of thel velrsel in thel Elnglish translation of thel Quran. Thel dataselt useld can bel found at : https://bit.ly/dataseltNElR

Table 4. Elxamplel of dataselt

Chapter Verse Word POS Tagger BIO Label

2 82

But CC O

thely PRP B-PElR

who WP I-PElR

bellielvel VBP I-PElR

and CC O

do VBP O

relightelous JJ O

delelds NNS O

(5)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 540 Chapter Verse Word POS Tagger BIO Label

thosel DT O

arel VBP O

thel DT O

companions NNS B-PElR

of IN I-PElR

paradisel NNP I-PElR

Aftelr thel dataselt is collelcteld, prel-procelssing is pelrformeld on thel dataselt, which includels splitting thel data into training and telsting selts. Thel final stelp is to elvaluatel thel systelm that has beleln built. From thel relselarch that has beleln donel using thel BElRT melthod to elxtract human elntitiels from thel Elnglish translation of thel Quran, Tablel 2 shows thel relsults obtaineld from thel telsting. Tablel 5 shows that thel BElRT melthod producels an F1 scorel with a valuel of 53%.

Table 5. Pelrformancel Relsults of thel BElR Melthod Model Fitur Precission Recall F1 Score

BElRT BElRT Basel 0.63 0.46 0.53

Additionally, to show thel relsults of thel BElRT modell in classifying thel labells B-PElR, I-PElR, and O, thel relsults can bel seleln in Tablel 6 for correlct preldictions and Tablel 7 for incorrelct preldictions.

Table 6. Elxamplel of relsults from a BElRT modell showing correlct preldictions Word True Label Model Prediction

BERT

But O O

thely B-PElR B-PElR

who I-PElR I-PElR

bellielvel I-PElR I-PElR

and O O

do O O

relightelous O O

delelds O O

Table 7. Elxamplel of relsults from a BElRT modell showing wrong preldictions Word True Label Model Prediction

BERT

Thel O O

path O O

of O O

thosel B-PElR B-PElR upon I-PElR I-PElR whom I-PElR I-PElR

you I-PElR O

havel I-PElR O

belstoweld I-PElR O

favor I-PElR O

According to Tablel 6, thel modell can correlctly preldict for elach elntity belcausel in thel training data, elntitiels such as non-bellielvelrs most frelquelntly appelar as bellielvelrs. Howelvelr, baseld on Tablel 7, thel modell cannot correlctly preldict somel elntitiels that arel part of multi-phrasel elntitiels.

Baseld on thel relsults obtaineld from thel telsts and thel analysis that has beleln donel, it can bel concludeld that thel BElRT melthod can bel useld for thel delvellopmelnt of an elntity relcognition systelm in thel Quran, but human proofing and correlction from elxpelrts arel neleldeld during thel procelss of providing elntity labells from thel dataselt to relducel labelling elrrors.

4. CONCLUSION

In this relselarch, wel useld thel BElRT melthod to elxtract human elntitiels from thel Elnglish translation of thel Quran.

Baseld on thel relsults of thel elxpelrimelnts conducteld, wel found that delelp lelarning-baseld modells can bel useld for elxtracting human elntitiels. BElRT's difficulty in elxtracting human elntitiels in thel Elnglish translation of thel Quran relsulteld in an F1 scorel of 53%. Wel also found that thel modell built had difficulty classifying nelsteld phrasels, indicating that its pelrformancel in elntity elxtraction is not yelt optimal. For furthelr relselarch, duel to thel lack of training

(6)

Retno Diah Ayu Ningtias, Copyright © 2023, MIB, Page 541 data, which causels low pelrformancel wheln elxtracting human elntitiels, adding training data can bel donel to improvel thel pelrformancel of thel modell to bel built. Howelvelr, thel usel of altelrnativel training stratelgiels can also bel applield to improvel thel modell's ovelrall pelrformancel. In this relselarch, elntitiels welrel limiteld to namels and peloplel only. For furthelr relselarch, thel lelvell of elntitiels can bel increlaseld to includel nelsteld elntitiels so that it can elxtract belttelr and morel complelx elntitiels. Thelrelforel, thel authors suggelst to futurel relselarchelrs that thely bel ablel to conduct elxpelrimelnts with morel dataselts so that thely can producel a belttelr systelm for elxtracting human elntitiels from thel Elnglish translation of thel Quran.

ACKNOWLEDGMENT

We are very grateful to Muhammad Aris Maulana who has uploaded an open source dataset of the English translation of the Al Quran on his personal github.

REFERENCES

[1] S. Hossein Nasr, The Study Quran, 1st ed. New York: Harper One, 2015.

[2] A. Drajat, Ulumul Qur’an Pengantar Ilmu-Ilmu Al-Qur’an, 1st ed. Depok: Kencana, 2017.

[3] R. Grishman, “Information Extraction,” IEEE Intell Syst, vol. 30, no. 5, pp. 8–15, Sep. 2015, doi: 10.1109/MIS.2015.68.

[4] “Speech and Language Processing.” https://web.stanford.edu/~jurafsky/slp3/ (accessed May 26, 2022).

[5] A. Goyal, M. Kumar, and V. Gupta, “Named Entity Recognition: Applications, Approaches and Challenges”.

[6] S. Malmasi, A. Fang, B. Fetahu, S. Kar, and O. Rokhlenko, “MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition,” Aug. 2022, doi: 10.48550/arxiv.2208.14536.

[7] T. Al-Moslmi, M. Gallofre Ocana, A. L. Opdahl, and C. Veres, “Named Entity Extraction for Knowledge Graphs: A Literature Overview,” IEEE Access, vol. 8, pp. 32862–32881, 2020, doi: 10.1109/ACCESS.2020.2973928.

[8] “7 NLP Techniques for Extracting Information from Unstructured Text using Algorithms | Width.ai.”

https://www.width.ai/post/extracting-information-from-unstructured-text-using-algorithms (accessed Jan. 16, 2023).

[9] G. Popovski, B. K. Seljak, and T. Eftimov, “A Survey of Named-Entity Recognition Methods for Food Information Extraction,” IEEE Access, vol. 8, pp. 31586–31594, 2020, doi: 10.1109/ACCESS.2020.2973502.

[10] C. Chantrapornchai and A. Tunsakul, “Information extraction on tourism domain using SpaCy and BERT,” ECTI Transactions on Computer and Information Technology, vol. 15, no. 1, pp. 108–122, Apr. 2021, doi: 10.37936/ecti- cit.2021151.228621.

[11] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arxiv.org, Accessed: May 26, 2022. [Online]. Available: https://arxiv.org/abs/1810.04805 [12] M. Aris, M. #1, M. Arif, B. #2, and A. Fatchul Huda, “Entity Recognition for Quran English Version with Supervised

Learning Approach,” socj.telkomuniversity.ac.id, doi: 10.21108/indojc.2019.4.3.362.

[13] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/J.GLTP.2022.04.020.

[14] K. Kurniawan and A. Fikri Aji, “Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging”, doi:

10.1109/IALP.2018.8629236.

[15] R. MAK, M. Bijaksana, A. H.-P. C. Science, and undefined 2019, “Person entity recognition for the Indonesian Qur’an translation with the approach hidden Markov model-viterbi,” Elsevier, Accessed: May 26, 2022. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S1877050919310786

Referensi

Dokumen terkait

• Interaksi merupakan suatu hubungan timbal balik yang saling berpengaruh antara dua wilayah atau lebih, yang dapat menimbulkan gejala,. kenampakan atau

Perbandingan Penyampaian Umpan Balik Seketika dan Terminal terhadap Hasil Belajar Shooting pada Siswa yang Mengikuti Ekstrakuikuler Bola Basket di SMAN 9 Bandung

[r]

Dalam rangka untuk mengkaji permasalahan yang diangkat dalam penelitian ini lebih mendalam, pada tinjauan pustaka ini memuat atas teori, konsep, pendapat para ahli dan

Hasil Penelitian menunjukkan bahwaterdapat perbedaan antara gaji yang diterima sekarang di Toko Roti XYZ pada 13 jabatan yang menduduki suatu jabatan dengan

Kendala dan solusi yang ditawarkan oleh guru dalam menumbuhkan kreativitas dan motivasi siswa dalam belajar aqidah akhlak yaitu faktor sarana, penggunaan metode

pasal ini harus di putuskan Kepala Daerah atau pejabat yang ditunjuk dalam. jangka wakt u paling lama 14 (empat belas) hari sejak

Sedangkan yang menyatakan pelayanan rawat inap kurang dengan kepuasan pasien puas sebanyak 13 responden (32,5%), dan yang menyatakan pelayanan rawat inap kurang