Question Answering Chatbot using Ontology for History of the Sumedang Larang Kingdom using Cosine Similarity as Similarity
Measure
Rinaldi Jasmi, Z K A Baizal*, Donni Richasdy
Informatics, School of Computing, Telkom Universiry, Bandung, Indonesia
Email: 1[email protected], 2,*[email protected], 3[email protected] Corresponding Author Email: [email protected]
Abstract−Information can also be a means of learning for humans. Including information about history because history can be a means of learning for the younger generation to appreciate the nation's culture and build national identity. In the past, t he Sumedang Larang kingdom was one of the many kingdoms in West Java, Indonesia, that could be used as much information as a lesson. Technological developments make more and more information available for study. We need the proper means to find the information we need. This study aims to build a Question Answering (QA) system to create a means for the younger generation to be more familiar with the history of the kingdom in the past. The QA system offers an information retrieval system that is easy to access and can immediately provide the answers we need. This QA system was built using ontology as a knowledge base and cosine similarity to determine the similarity between user questions and the dataset. The QA system that has been built is tested by providing a set of questions so that the system's performance can be measured, and the results of system testing get a precision value of 70% and a recall value of 90%.
Keywords: Question Answering; Ontology; Sumedang Larang Kingdom; Cosine Similarity
1. INTRODUCTION
In recent years, sources of information and knowledge have increased so rapidly that we need search engines to get more information more conveniently. However, we often get too large of information based on specific queries.
Therefore, we must spend much time searching for relevant data. Question Answering (QA) system can be a technology that can overcome this problem because this system aims to answer user questions accurately in natural language. Compared to other search engines, the QA system will provide clear answers to questions directly rather than providing a list of relevant documents and hyperlinks, which can increase usability and efficiency [1]. The need for a QA system is increasing day by day because this system can provide short, concise, and concrete answers to questions [2].
The QA system aims to find the correct answers to user questions, both in unstructured data sets (such as message sets) and structured data sets (such as databases) [3]. In an ontology-based QA system, data is stored in an ontology because an ontology is one of a structured data set. Ontology describes the conceptual representation of concepts about a particular domain. Ontology can be seen as a more sophisticated knowledge base than a relational database [3]. Ontology is a knowledge tool that can be useful for coordinating and understanding the concept of a domain [4]. Some researchers have used ontology-based understanding to solve several semantic and knowledge modeling problems because ontology has begun to be considered a powerful tool for achieving semantic interoperability [5].
In this study, an ontology-based QA system will be built that can provide historical information and knowledge on the Sumedang Larang Kingdom. This system will answer user questions in Indonesian. History is an essential and meaningful record for humans because history can be in the form of human experiences in the past. Sumedang Larang is a kingdom now known as a district in West Java Province centered in Sumedang. The kingdom's existence in Sumedang has been recorded in ancient books or manuscripts such as Carita Parahiyangan and Bujangga Manik's Notes (1473). The Sumedang Larang Kingdom is one of the many kingdoms in Indonesia, especially in West Java, which significantly influences the development of culture in West Java. This is a strong reason for conducting this research.
This research is also based on several previous studies because several previous studies have succeeded in combining an ontology-based QA system to provide information on specific domains. The QAPD system is an ontology-based QA system to provide informative information for users in the physics domain. QAPD uses the ISM approach to assist in answering user questions before using specific queries to access the ontology [3].
ADANS is also an ontology-based QA system with a domain in agriculture. ADANS is built to be compatible with the semantic web [6]. An ontology-based QA system will not only provide answers based on a specific domain.
An ontology-based QA system can also be used to build a language-focused QA system. In previous studies, there was a QA system to answer questions in Arabic. The QA system is built using a dataset built using ontology to generate an ontology dictionary, after which questions will be selected based on the ontology information to be decomposed into keywords using a SPARQL query [7].
This study is different from QAPD. The QA system in this study will use Cosine Similarity as a Similarity Measure, which will assist in answering user questions before using SPARQL queries to access the ontology. The results of measuring the similarity of texts using the cosine of similarity will help in the construction of SPARQL queries. The QA system in this study was built so that it can and is suitable to be applied to a different chatbot
from ADANS, which was built to match the semantic web. Chatbots are used because of the ease of access they offer developers and users.
2. RESEARCH METHOD
2.1 Question Answering System
Question answering (QA) systems have evolved into a powerful platform for automatically answering questions asked by humans in natural language. The main task of the QS system is to extract short answers to questions asked in natural language. The QA system requires various natural language processing (NLP) techniques to produce concise and accurate answers to natural language questions [3]. The difference between the QA system and & the Information Retrieval (IR); is that the IR system will provide a set of documents related to the user's information needs, while the QA system will provide written responses instead of a set of documents [3]. Many web search engines, such as Google and Bing, integrate QA technology into their search capabilities. With this technique in mind, search engines can now respond directly to various queries [1].
Several studies define the QA System architecture in three parts: Question Processing, Document Processing, and Answer Processing [2]. Question Processing will receive input from the user in the form of a question in natural language. The questions will be analyzed and classified, which aims to find the types of questions that are useful in avoiding ambiguity [2]. Document Processing is in charge of selecting relevant documents and extracting a collection of paragraphs into answers depending on the focus of the questions that have been processed in Question Processing, and then the data will be sorted according to their relevance to the question [2]. Answering Processing is part of the QA system that works for extracting the results of the Document Processing module to present answers. In this study, these components will also be used in forming a QA system.
2.2 Ontology
Ontology is a technology that provides knowledge domains, and ontologies can help increase the query time used in the QA system [3]. According to some researchers, ontology can provide knowledge representations that can be reused and shared between software systems. Therefore, ontology can convey knowledge to humans and computer systems [8]. Ontology can also be considered a database but has a more complicated Knowledge Base (KB), and to retrieve knowledge in the ontology, a structured language called SPARQL is developed [9]. Building an ontology can be defined as an iterative process, whether building it from scratch or reusing an existing ontology to improve or implement it. The ontology development process involves six tasks [10]:
a. Define scopes to create well-defined terms and concepts.
b. Identify critical terms, concepts, and their relationships within the domain.
c. Establish or complete rules and axioms that describe the structural characteristics of a realm.
d. Coding of ontology created using an expression language that supports ontology, such as RDF, RDFS, and OWL.
e. Merge the created ontology with an existing ontology (if available).
f. Score the constructed ontology using general and specific scoring metrics.
In this study, we use protege for developing ontology. Protege is an ontology editor that is open source and famous, so it is widely used worldwide. Protege was chosen because it provides rich functionality for creation, modification, query, and sound visualization capabilities [11].
2.3 Sumedang Larang Kingdom
The Sumedang Larang Kingdom was increasingly recognized on the stage of Indonesian national history, especially in the Land of Parahiyangan, after the end of the Pajajaran Kingdom or the Sunda Kingdom in 1580.
The Sumedang Larang Kingdom, which has now changed to Sumedang Regency, is located in the West Java Province. In the past, the Sumedang Larang Kingdom was one of the successor kingdoms of the Pajajaran Kingdom or the Sunda Kingdom. When the Pajajaran Kingdom was on the verge of collapse and the Sumedang Larang Kingdom was then led by Prabu Geusan Ulun, the power of the Pajajaran Kingdom symbolically transferred to the Sumedang Larang Kingdom with the surrender of the golden crown of the Pajajaran Kingdom. to King Geusan Ulun.
The Sumedang Larang Kingdom is also rich in culture and historical heritage. Mount Tampomas, Mount Lingga, Darmaraja, and Pasanggarahan are part of the Sumedang area, where many sites and heirlooms of the working era are stored in the Prabu Geusan Ulun Museum.
2.4 Chatbot
Human-Computer Interaction (HCI) is a technology that allows users and machines to communicate using natural language. A chatbot is one of the technologies for human-computer conversation designed to convince people that they are talking to humans rather than machines. Chatbots are wide use in various fields, including customer service, website support, and schools. According to a recent report, 80% of businesses expect to use chatbots by 2020 [12].
Chatbot can also be applied in education because, compared to traditional e-learning, chatbot generate a more positive response from users. There are several advantages to this learning using chatbot, such as increasing interaction, learning to be more active, and increasing social skills [13]. We built the chatbot using TelegramAPIBot, an HTTP-based Bot API for developers who want to build bots for Telegram.
2.5 Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency (TF-IDF) is a weighting technique that is commonly used because TF-IDF is a statistical method used to evaluate a word's importance for a document [14]. TF-IDF is a combination of Term Frequency and Inverse Document Frequency. TF is a measure of the number of times a word appears in a document, and IDF is a measure of the importance of a word. The TF-IDF process is carried out on the dataset that has been collected to obtain the vector values needed for the similarity measure process.
𝑇𝑓𝐼𝑑𝑓(𝑡, 𝑑) = 𝑇𝑓(𝑡, 𝑑) ∗ 𝐼𝑑𝑓(𝑡) (1)
Equation 1 is an equation for calculating the value of TF-IDF. In the equation, it can be seen that the TF- IDF value comes from the calculation between TF and IDF. Where 𝑇𝑓 is the number of selected word frequencies, 𝐼𝑑𝑓 is the number of word frequencies in the document. The value of 𝑑 is the index of the d-document and the value of 𝑡 is the index of the t-word.
𝐼𝑑𝑓 = log 𝑁
𝑑𝑓(𝑡) (2)
The Idf value comes from equation 2, where the value N is the number of documents contained in the dataset and df is the number of words that appear in the document.
2.6 Cosine Similarity
Cosine Similarity is a metric or calculation to calculate the similarity between two texts by taking the vector value using the value of the vectors of the two texts to be compared [15], [16]. The cosine similarity result is a value between 0 and 1, where a value of 0 indicates no similarity between the documents and a value of 1 indicates that the documents are identical [15]. In this QA system, cosine similarity is to calculate the similarity between the questions entered by the user and the dataset provided. Cosine Similarity has a calculation using equation (1).
Where 𝑑𝑜𝑐1 is the first text or input from the user and 𝑑𝑜𝑐2 is the second text or text from the dataset provided.
𝐴𝑖 represents the value of vector 𝑑𝑜𝑐1, and 𝐵𝑖 represents the value of vector 𝑑𝑜𝑐2.
𝑆𝑖𝑚(𝑑𝑜𝑐1, 𝑑𝑜𝑐2) = 𝑑𝑜𝑐1 ∙𝑑𝑜𝑐2
‖𝑑𝑜𝑐1‖‖𝑑𝑜𝑐2‖= ∑𝑛𝑖=1𝐴𝑖𝐵𝑖
√∑𝑛𝑖=1𝐴2𝑖√∑𝑛𝑖=1𝐵𝑖2
(3)
2.7 Evaluation Metrics
To evaluate the performance of the QA system, we use three different standardized measures: Precision (Prec), Recall (Rec), and F-Measure (F1). Precision will indicate which part of the answer identified by the system is the correct answer. Recall shows which part of the correct answer was identified by the system [3]. The precision calculation will be calculated using equation (4), and the recall calculation will be used in equation (5).
𝑃𝑟𝑒𝑐 = 𝑇𝑃
𝑇𝑃+𝐹𝑃 (4)
𝑅𝑒𝑐 = 𝑇𝑃
𝑇𝑃+𝐹𝑁 (5)
F-Measure will be used to consider two matrices together. F-Measure is a statistical measure that combines precision and recall [3]. F-Measure will be calculated using equation (6).
𝐹1 = 2 ∙𝑝𝑟𝑒𝑐 ∙𝑟𝑒𝑐
𝑝𝑟𝑒𝑐+𝑟𝑒𝑐 (6)
Table 1. Definition of TP, TN, FP, FN Correct Answer Detected as correct answer
TP Yes Yes
TN Yes No
FP No Yes
FN No No
3. RESULTS AND DISCUSSIONS
In this section, the QA system that we built will describe. An overview of the QA system is shown in Fig.1. In this study, the QA system built is a question-answer system that takes questions written in natural language and returns
answers from the knowledge base that has entered into the ontology. The QA system architecture consists of 4 components: user interface components, question processing components, document processing components, and answer processing components.
Figure 1. System Architecture 3.1 User Interface Component
A User Interface is a place where users will interact with the built QA system. In this QA system, the user interface we built is in the form of a chatbot on the Telegram platform. In this component, users can interact with the system by entering questions in the natural language and Indonesian. In this component, the system has several processes, such as receiving input questions from the user and providing answers to the user. Input Question in this section, the system will receive input from users in the form of questions in natural language. The answer in this process, the user will get the answers needed to answer the user's questions in text format.
3.2 Question Processing Component
This component aims to process questions that the user has entered. The question will then be analyzed linguistically and further processed. This component contains some text pre-processing processes, in which text pre-processing is a process for selecting text data to become more structured by going through a series of stages.
Therefore, the pre-processing text process becomes a crucial stage in document processing [17]. In this component, the pre-processing text process includes Case Folding, Stopwords Removing, and Stemming.
Case Folding is the first stage used in the text pre-processing process. Case Folding, this process aims to change all letters in a text collection to lowercase. It aims to change the text in the standard form. The Case Folding process is applied as follows:
1. Question: “Siapa Raja Pertama Kerajaan Sumdang Larang? (Who was the first king of the Sumedang Forbidden Kingdom?)“
2. Case Folding Result: ”siapa raja pertama kerajaan sumedang larang (who was the first king of the sumedang larang kingdom?)”
Stopwords Removal, this process aims to remove several stopwords in a text. Stopwords are some words that exist in the natural language but are not processed by computer algorithms, such as "yang, ke, di" (which, to, in) in Indonesian. Removing some text's stopwords can reduce the word space's dimensions that look heavy [17].
The QA system we built was assisted by the stopwords removal process using the Sastrawi library. An example of the process is as follows:
1. Question: “siapa raja pertama pada kerajaan sumedang larang”
2. Stopword Removal Result: “siapa raja pertama kerajaan sumedang larang (who was first king of sumedang larang kingdom?)”
Stemming is the process of mapping words related to the stem or returning to their primary form [17], [18].
Inthis QA system, the stemming process is assisted using the Sastrawi library with the following example process:
1. Question: “siapa raja kedua kerajaan sumedang larang (who is the king of the two kingdoms of sumedang larang)”
2. Stemming Result: “siapa raja dua raja sumedang larang (who is the king of the two kings of sumedang larang)”
3.3 Document Processing Component
This component will process a dataset that contains a collection of statements about the domain of this QA system or the Sumedang Larang Kingdom. This statement dataset will be processed to be used as a document to calculate the similarity of text with user questions obtained from Question Processing. The dataset contains text about the Sumedang Larang Kingdom, a collection of subjects, a collection of objects, and a collection of keywords that will be used for further processing.
In this component, the dataset will perform several processes, namely the Term Frequency-Inverse Document Frequency (TF-IDF) process, and the Similarity Measure process using Cosine Similarity. TF-IDF will process questions from users before similarity measures do with the dataset provided.
a. Dataset Question
The Question dataset in the QA system that we built contains a statement text that will measure the similarity with input from the user, each question text attached with the subject word, object, and keywords that will be used for the Answer Processing process. Some of the datasets that will be used in the QA system we built are shown in Table 1.
Table 2. Sample data from the dataset question
Text Subject Object Keyword
nama raja pertama sumedang larang (name of the first king of sumedang larang)
Pemimpin (leader)
nama (name)
raja pertama (first king) nama ayah batara tuntang buana (name of batara
tuntang buana father)
Pemimpin (leader)
anak_dari (child_of)
batara tuntang buana nama lain batara tuntang buana (another name of
batara tuntang buana)
Pemimpin (leader)
gelar (title) batara tuntang buana nama raja kedua sumedang larang (name of the
second king of sumedang larang)
Pemimpin (leader)
nama (name)
raja kedua (second king) mengapa ibukota sumedang larang pindah pada
masa pemerintahan Raja Wirajaya (why is the capital of Sumedang Larang to move during the reign of King Wirajaya)
sejarah (history)
Detail (detail)
pindah ke
cipameungpeuk (move to cipameungpeuk)
b. Term Frequency-Inverse Document Frequency (TF-IDF)
The TF-IDF process is carried out on the dataset that has been collected to obtain the vector values needed for the similarity measure process. The TF-IDF process was chosen because TF-IDF is one of the best metrics for determining document frequency values. The TF-IDF algorithm assigns weights to keywords and looks for similarities between keywords and available categories. The TF-IDF score increases when the number of words displayed in the document matches the number of words in the data set. In this research, the dataset question will be processed by TF-IDF to find the document frequency value, which will be used as a vector value for calculating the similarity measure. In Table III.1 it can be seen that there are 5 documents available. The document will be separated word by word and will be collected into a data set. From Table 2, we get a data set of 11 words (as Table 3).
Table 3. Example of data that will be processed by TF-IDF
Text Document
Length
Word data set
nama raja pertama sumedang larang (the name of the first king of sumedang larang)
Text 1
5 nama, ayah, raja, pertama, kedua, lain, sumedang, larang, batara, tuntang, buana (name, father, king, first, second, other, sumedang, ban, batara, tuntang, buana)
nama ayah batara tuntang buana (the name of the father of batara tuntang buana)
Text 2
5
nama lain batara tuntang buana (another name for batara tuntang buana)
Text 3
5
nama raja kedua sumedang larang (the name of the second king of sumedang is forbidden)
Text 4
5
The word from the data set that will be the reference in the TF-IDF calculation uses equation (1) and equation (2).
Table 4. TF value calculation result
Word data set TF
Text 1 Text 2 Text 3 Text 4
nama (name) 0,2 0,2 0,2 0,2
ayah (father) 0 0,2 0 0
raja (king) 0,2 0 0 0,2
pertama (First) 0,2 0 0 0
kedua (second) 0 0 0 0,12
Word data set TF
Text 1 Text 2 Text 3 Text 4
lain (other) 0 0 0,2 0
sumedang (sumedang) 0,2 0 0 0,2
larang (larang) 0,2 0 0 0,2
batara (batara) 0 0,2 0,2 0
tuntang (tuntang) 0 0,2 0,2 0
buana (buana) 0 0,2 0,2 0
Table 4 calculates the TF value from each existing document. The TF value is calculated by counting the number of occurrences of the word and dividing it by the total length of the document. For example, if the word
"nama"(name) in text document 1 appears once, then the TF value of the word "name" for text 1 is 1 divided by 5, which is 0.2.
Table 5. the results of the calculation of the value of DF and IDF
Word data set DF IDF
nama (name) 4 0
ayah (father) 1 0,602
raja (king) 2 0,301
pertama (first) 1 0,602
kedua (second) 1 0,602
lain (other) 1 0,602
sumedang (sumedang) 2 0,301
larang (larang) 2 0,301
batara (batara) 2 0,301
tuntang (tuntang) 2 0,301
buana (buana) 2 0,301
In table 5, the DF value is obtained from the number of words that appear in the document. For example, the word "nama"(name) in the document appears four times, namely in Text 1, Text 2, Text 3, and Text 4. The IDF value is obtained using equation (3), where the value of N is the number of documents contained in the dataset.
For example, the number of N is 4 (text 1, text 2, text 3, and text 4) it will be : 𝐼𝑑𝑓(𝑛𝑎𝑚𝑎) = log4
4= 0 Table 6. TF-IDF value calculation result
Word data set TF-IDF
Text 1 Text 2 Text 3 Text 4
nama (name) 0 0 0 0
ayah (father) 0 0,1204 0 0
raja (king) 0,0602 0 0 0,0602
pertama (first) 0,1204 0 0 0
kedua (second) 0 0 0 0,1204
lain (other) 0 0 0,1204 0
sumedang (sumedang)
0,0602 0 0 0,0602
larang (larang) 0,0602 0 0 0,0602
batara (batara) 0 0,0602 0,0602 0
tuntang (tuntang) 0 0,0602 0,0602 0
buana (buana) 0 0,0602 0,0602 0
In Table 6, the TF-IDF value is obtained for text 1, text 2, text 3 and text 4. The TF-IDF value is obtained using equation (1), for example for the word "raja"(king) in text 1, the TF value is 0.2 (in the table) and get the IDF value of 0.301, the TF-IDF value will be 0.2 ∗ 0.301 = 0.0602. Furthermore, the TF-IDF value from the document will be used to perform a similarity measure using cosine similarity.
c. Cosine Similarity
The similarity measure used for the built QA system is using Cosine Similarity. The cosine similarity task calculates the similarity level between the user's input text and the text from the question dataset used. The cosine similarity process will produce the similarity value of the user input text with the text from the dataset. In the QA system built, the similarity measure process using Cosine Similarity uses the library from sklearn. Input from the user after being processed by the first question processing will be converted into a vector value using the TF-IDF process (as Table 8). After obtaining the vector value, the similarity value will be searched using equation (3).
Table 7. Example of input from a user to perform a similarity measure
User input User input after question processing Siapa Nama Raja pertama Kerajaan Sumedang
Larang adalah? (What is the name of the first king of the Sumedang Forbidden Kingdom?)
nama raja pertama raja sumedang larang (name first king of king of sumedang larang)
Table 8. TF-IDF calculation results on user input
Word data set TF IDF TF-IDF
nama (name) 0,167 0 0
raja (king) 0,333 0.301 0,1002
pertama (first) 0,167 0,602 0,1005
sumedang (sumedang) 0,167 0,301 0,1002
larang (larang) 0,167 0,301 0,1002
From the results of Table 8, a similarity measure will be carried out using equation (3).
Table 9. Results dot product(user input, document) 𝒅𝒐𝒕 𝒑𝒓𝒐𝒅𝒖𝒄𝒕(𝒅𝒐𝒄𝟏, 𝒅𝒐𝒄𝟐) User input
Text 1 0,0301
Text 2 0
Text 3 0
Text 4 0.0180
Table 9 𝑑𝑜𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡(𝑢𝑠𝑒𝑟 𝑖𝑛𝑝𝑢𝑡, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡) or 𝑑𝑜𝑐1 ∙ 𝑑𝑜𝑐2 where 𝑑𝑜𝑐1 is input from user and 𝑑𝑜𝑐2 is a text from the question dataset. Example of 𝑑𝑜𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡(𝑢𝑠𝑒𝑟 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡) using user input as 𝑑𝑜𝑐1 and text 1 as 𝑑𝑜𝑐2:
𝑑𝑜𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 (𝑖𝑛𝑝𝑢𝑡 𝑝𝑒𝑛𝑔𝑔𝑢𝑛𝑎, 𝑡𝑒𝑘𝑠1) = 𝑑𝑜𝑐1. 𝑑𝑜𝑐2
= (0 × 0) + (0,1002 × 0602) + (0,1005 × 0,1204) + (0,1002 × 0,0602) + (0,1002 × 0,0602) = 0,0301
Table 10. Value calculation result ||doc1|| and ||doc2||
Vector space value (||doc1|| and ||doc2||)
Text 1 0,1592
Text 2 0
Text 3 0
Text 4 0.1042
User Input 0,2005
In table 10, the vector space value is calculated or ||doc1||and ||doc2|| where doc1 is the input from the user and doc2 is the text of the question dataset. Example calculation looking for vector space value or ||doc1|| and
||doc2|| using user input as doc1 and text 1 as doc2 :
||𝑑𝑜𝑐1|| = √(0)2+ (0.1002)2+ (0.1005)2+ (0.1002)2+ (0.1002)2= 0,2005
||𝑑𝑜𝑐2|| = √(0)2+ (0.0602)2+ (0.1204)2+ (0.0602)2+ (0.0602)2= 0.1592
Table 11. The results of the similarity measure Similarity
Text 1 0.9429
Text 2 0
Text 3 0
Text 4 0.86156
In table 11, calculations are carried out to obtain the results of the similarity value between user input and the question dataset. The calculation is done by dividing the 𝑑𝑜𝑡 𝑝𝑟𝑜𝑑𝑢𝑐𝑡(𝑑𝑜𝑐1, 𝑑𝑜𝑐2) by the product of the user input vector space value and the question dataset vector space value. Example of a calculation looking for similarity values using user input and text 1:
𝑆𝑖𝑚 (𝑢𝑠𝑒𝑟 𝑖𝑛𝑝𝑢𝑡, 𝑡𝑒𝑘𝑠 1) = 0,0301/(0,2005 × 0,1592) = 0.9429
The results of the similarity values in Table III.10 were obtained using input from the user "Siapa Nama Raja pertama Kerajaan Sumedang Larang adalah?” (What is the name of the first King of the Sumedang Larang Kingdom?). With the results in Table III.3, the highest value will be considered to have the most similar level to the input from the user is text 1 or "nama raja pertama sumedang larang“ (name of the first king of Sumedang Larang)
3.4 Answer Processing Component
This component searches for answers to questions from users. This component takes the answer from the knowledge base that has been built. The ontology becomes the knowledge base of this QA system. The ontology contains all the required data regarding the Sumedang Larang kingdom. The results of the similarity measure are used in this answer processing component. In the question dataset, the subject, object, and keyword words from the text, which consider having the highest resemblance, remain. It is helpful in SPARQL queries in retrieving answers in the ontology to match the query entered by the user.
a. Query SPARQL
SPARQL is a standard query to access triplestore RDF data. SPARQL also allows users to access data or information from databases or data sources mapped in RDF form. SPARQL will connect the data in the RDF graph via graph pattern matching. In this QA system, the subject, object, and keyword words in the question dataset help SPARQL search for answers contained in the ontology. The words obtained earlier served to help SPARQL in finding answers that matched the questions entered by the user.
Figure 2. SPARQL Query
Fig. 2 shows how the SPARQL query forms in this QA system. The word “subject” in the query aims to indicate the class to which data will be retrieved in the ontology, the word “object” will define the object properties or relations that will be used, and the word “katakunci” helps select individuals who are by what is required by the user inputted question.
b. Ontology
The ontology in this QA system functions as a knowledge base that contains information about the history of the Sumedang Larang Kingdom. Ontology is used in this system because ontology is more focused on defining data, while ordinary databases only describe data [19]. Ontology has several ontology builder components consisting of Class and Property. Class components are determined based on a standard LOM (Learning Object Metadata) survey. Not all standard LOM metadata use when defining a class. Only some critical metadata use to make a class.
In this QA system, the ontology section has 4 main classes, namely 4 main classes (character, place of government, history, and heritage). Some of these main classes are still divided into subclasses. The ontology's property component is divided into two, namely Object Properties and Data Properties.
Table 12. Object Properties used in the ontology used by the QA system
No Properties Name Domain Range Character
1 anak_dari (child_of) tokoh (figure) tokoh (figure) Inverse of ayah_dari (father_of), ibu_dari (mother_of) 2 ayah_dari
(father_of)
tokoh (figure) tokoh (figure) Inverse of anak_dari (child_of)
3 suami_dari (husband_of)
tokoh (figure) tokoh (figure) Inverse of istri_dari (wife_of)
4 istri_dari (wife_of) tokoh (figure) tokoh (figure) Inverse of suami_dari (husband_of) 5 ibu_dari
(mother_of)
tokoh (figure) tokoh (figure) Inverse of anak_dari (child_of)
6 saudara_dari (sibling_of)
tokoh (figure) tokoh (figure) -
7 Dialami (experienced)
sejarah (history) pemimpin (leader) Inverse of mengalami (experience)
8 dijadikan_ibukota (made_capital)
tempat_pemerintahan (govement place)
pemimpin (leader) Inverse of memindahkan_ke (move_to) 9 memindahkan_ke
(move_to)
pemimpin (leader) tempat_pemerintahan (goverment_place)
Invers of
Dijadikan_ibukota (made_capital) SELECT *
WHERE { ?subject rdf:type sumedanglarang:"""+subject+""" ; sumedanglarang:kata_kunci ?katakunci ; sumedanglarang:"""+object+""" ?object FILTER(STR(?katakunci) = '"""+katakunci+"""') }
No Properties Name Domain Range Character 10 Mengalami
(experience)
pemimpin (leader) sejarah (history) Invers of dialami (experience) Table 13. Data Properties used in the ontology used by the QA system
No. Properties Name Domain Range
1 Detail (detail) sejarah (history) string
2 gelar (title) tokoh (figure) string
3 jenis_kelamin (gender) tokoh (figure) string
4 masa_jabat (term_position) pemimpin (leader) string
5 Nama (name) tokoh (figure) string
6 Tahun (year) sejarah (history) string
7 Urutan (order) pemimpin (leader) integer
Table 4 and Table 5 are object properties and data properties used in the ontology for this QA system. Each property has its function, such as “anak_dari” (child_of) property in Table 12 to define the relationship between characters in the form of offspring or children and “gelar” (title) property in Table 13 to define the title of a king.
3.5 Evaluation
The QA system we built aims to create a system that can assist users in providing answers to user questions regarding historical information about the Sumedang Larang Kingdom. In this section, the QA system we built will test by, and how the efficiency and performance of this QA system will calculate. Three standard measures are used to evaluate system performance: Precision, Recall, and F-measure. The test uses the test dataset that provides.
a. Testing Dataset
The testing dataset is formed based on the existing question dataset. The test dataset contains questions from each statement text in the question dataset. In total, 73 questions will be used for testing. Table 6 will show an example of the contents of the test dataset.
Table 14. Sample questions on the testing dataset
Teks (Text) Pertanyaan (Question)
nama raja pertama sumedang larang (name of the first king of sumedang forbid)
Siapa nama raja pertama Kerajaan Sumedan Larang? (What is the name of the first king of the Sumedan Larang Kingdom?) nama ratu pertama sumedang larang (name
of the first queen of sumedang larang)
Siapa nama ratu pertama Kerajaan Sumedang Larang? (What is the name of the first queen of the Sumedang Larang Kingdom?) nama raja keempat sumedang larang (the
name of the fourth king of sumedang larang)
Siapa nama raja keempat Kerajaan Sumedang Larang? (What is the name of the fourth king of the Sumedang Larang
Kingdom?) nama raja kedelapan sumedang larang (the
name of the eighth king of sumedang larang)
Siapa nama raja kedelapan Kerajaan Sumedang Larang? (Siapa nama raja kedelapan Kerajaan Sumedang Larang?)
masa kerajaan sumedang dibawah kerajaan pajajaran (the period of the sumedang kingdom under the pajajaran kingdom)
Pada masa pemerintahan siapa Kerajaan Sumedang Larang dibawah Kerajaan Pajajaran? (During whose reign was the Sumedang Larang Kingdom under the Pajajaran Kingdom?) b. Evaluation Results
Tests using equations (2) (3) and (4) get the results as shown in table 8.
Table 15. Evaluation Results
No. Evaluation Metrics Value
1 Precission 0.70
2 Recall 0.94
3 F-Measure 0.80
From the value of the results of this test, the precision value for this QA system is 0.70 or 70%, the Recall value for this QA system is 0.94 or 94%, and the F-Measure calculation in this QA system is 0.80 or 80%.
4. CONCLUSION
Based on this research in building a QA system, we notice that QA system can be applied to answer questions from users by providing direct answers that suit their needs. The results of the calculation of the performance and
efficiency of the QA system that was built are 70% of the answers identified by the system are correct answers (precision), 94% of answers are correct answers identified by the system (Recall) and 80% if combined the value of Precision and Recall by using the F-measure calculation. For further research, several things can be done in order to improve the performance of the QA system that will be built, such as: increasing the document similarity process by replacing other similarity measures in order to increase the value of the precision of the QA system, expanding the scope of the ontology to get more information from the Sumedang Larang Kingdom, and Optimizing SPARQL queries, because in this study there are still some questions that are answered in the form of semantic links.
REFERENCES
[1] F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T.-S. Chua, “Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.00774
[2] M. A. Calijorne Soares and F. S. Parreiras, “A literature review on question answering techniques, paradigms and systems,” Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 6. King Saud bin Abdulaziz University, pp. 635–646, Jul. 01, 2020. doi: 10.1016/j.jksuci.2018.08.005.
[3] A. Abdi, N. Idris, and Z. Ahmad, “QAPD: An ontology-based question answering system in the physics domain,” Soft Computing, vol. 22, no. 1, pp. 213–230, Jan. 2018, doi: 10.1007/s00500-016-2328-2.
[4] L. Yang, K. Cormican, and M. Yu, “Ontology Learning for Systems Engineering Body of Knowledge,” IEEE Transactions on Industrial Informatics, vol. 17, no. 2, pp. 1039–1047, Feb. 2021, doi: 10.1109/TII.2020.2990953.
[5] A. L. Fraga, M. Vegetti, and H. P. Leone, “Ontology-based solutions for interoperability among product lifecycle management systems: A systematic literature review,” J Ind Inf Integr, vol. 20, Dec. 2020, doi:
10.1016/j.jii.2020.100176.
[6] M. Devi and M. Dua, “ADANS: An agriculture domain question answering system using ontologies,” 2017.
[7] A. Albarghothi, F. Khater, and K. Shaalan, “Arabic Question Answering Using Ontology,” in Procedia Computer Science, 2017, vol. 117, pp. 183–191. doi: 10.1016/j.procs.2017.10.108.
[8] J. I. Single, J. Schmidt, and J. Denecke, “Knowledge acquisition from chemical accident databases using an ontology- based method and natural language processing,” Safety Science, vol. 129, Sep. 2020, doi: 10.1016/j.ssci.2020.104747.
[9] A. Arbaaeen and A. Shah, “Ontology-based approach to semantically enhanced question answering for closed domain:
A review,” Information (Switzerland), vol. 12, no. 5, 2021, doi: 10.3390/info12050200.
[10] F. N. Al-Aswadi, H. Y. Chan, and K. H. Gan, “Automatic ontology construction from text: a review from shallow to deep learning trend,” Artificial Intelligence Review, vol. 53, no. 6, pp. 3901–3928, Aug. 2020, doi: 10.1007/s10462- 019-09782-9.
[11] K. Schekotihin, P. Rodler, W. Schmid, M. Horridge, and T. Tudorache, “Test-driven Ontology Development in Protégé.” [Online]. Available: http://isbi.aau.at/ontodebug
[12] E. Amer, A. Hazem, O. Farouk, A. Louca, Y. Mohamed, and M. Ashraf, “A Proposed Chatbot Framework for COVID- 19,” in 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference, MIUCC 2021, May 2021, pp.
263–268. doi: 10.1109/MIUCC52538.2021.9447652.
[13] D. Carlander-Reuterfelt, A. Carrera, C. A. Iglesias, O. Araque, J. F. S. Sanchez Rada, and S. Munoz, “JAICOB: A Data Science Chatbot,” IEEE Access, vol. 8, pp. 180672–180680, 2020, doi: 10.1109/ACCESS.2020.3024795.
[14] H. Yuan, Y. Tang, W. Sun, and L. Liu, “A detection method for android application security based on TF-IDF and machine learning,” PLoS ONE, vol. 15, no. 9 September, Sep. 2020, doi: 10.1371/journal.pone.0238694.
[15] K. Park, J. S. Hong, and W. Kim, “A Methodology Combining Cosine Similarity with Classifier for Text Classification,”
Applied Artificial Intelligence, vol. 34, no. 5, pp. 396–411, Apr. 2020, doi: 10.1080/08839514.2020.1723868.
[16] B. il Kwak, M. L. Han, and H. K. Kim, “Cosine similarity based anomaly detection methodology for the CAN bus,”
Expert Systems with Applications, vol. 166, Mar. 2021, doi: 10.1016/j.eswa.2020.114066.
[17] M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi,” in IOP Conference Series: Materials Science and Engineering, Jul. 2020, vol. 874, no. 1. doi: 10.1088/1757-899X/874/1/012017.
[18] S. Chakrabarti, H. N. Saha, Institute of Electrical and Electronics Engineers. Las Vegas Section, and Institute of Electrical and Electronics Engineers, IEEE CCWC-2017 : 2017 IEEE 7th Annual Computing and Communication Workshop and Conference : 09-11 January, 2017, Las Vegas, USA.
[19] M. Uschold, “Ontology and database schema: What’s the difference?,” Applied Ontology, vol. 10, no. 3–4, pp. 243–
258, Dec. 2015, doi: 10.3233/AO-150158.