Jln. Khatib Sulaiman Dalam No. 1, Padang, Indonesia
Website: ijcs.stmikindonesia.ac.id | E-mail: [email protected]
Deep Learning Model for Answering Why-Questions in Arabic Tahani H. Alwaneen, Aqil M. Azmi
[email protected], [email protected]
Department of Computer Science, College of Computer Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Article Information Abstract Submitted : 7 Apr 2023
Reviewed : 13 Apr 2023 Accepted : 27 Apr 2023
The subfield of natural language processing (NLP) known as question answering (QA) involves providing answers to questions posed in natural language. Answering “why” questions has long been a challenging task for QA systems, given the complexity of the reasoning involved. In this paper, we propose a deep learning model for answering “why” questions in Arabic.
Recent advances in neural network models have yielded promising results across a range of tasks, particularly with the integration of attention and memory mechanisms. Our proposed model is based on the dynamic memory network (DMN), an architecture that utilizes attention and memory mechanisms to locate and extract relevant information for answering a question. We evaluate the performance of our DMN-based model in answering Arabic “why” questions using the LEMAZA dataset, achieving an F-score of 78.61%. Our findings suggest that DMN-based models hold promise for addressing the challenge of answering “why” questions in Arabic and other languages.
Keywords Arabic; Question- answering; Why question-answeing;
Deep-learning; Dynamic memory network.
A. Introduction
Question answering (QA) is a branch of natural language processing (NLP) that seeks to answer a natural language question posed by the user. Various types of question answering systems (QASs) are classified based on the type of questions they answer, such as factoid, non-factoid, and yes/no systems.
Factoid questions typically require a named entity answer, like a name or location, while non-factoid questions are more intricate and require detailed answers involving reasoning. Examples of non-factoid questions include “why” and
“how” questions, which are particularly challenging for NLP systems as they require understanding the context and recalling the information. Table 1 lists some question types with their expected answers and example questions.
Table 1. Taxonomy of questions
Type Kind of expected answers Example questions Factoid Named entity (e.g., person) What is the capital of Saudi Arabia?
List Enumeration List the best three restaurants?
Definition Definition What is deep learning?
Non-factoid Explanation, reason Why do we pay zakat?
Yes/no Yes or No Makkah is a city in Saudi Arabia?
The development of QASs for non-factoid questions, particularly in Arabic, is a complex task that poses challenges like limited availability of training data and linguistic complexity. Most existing Arabic QASs were designed to answer factoid questions requiring brief answers. However, very few works were developed for handling non-factoid questions.
In recent years, the field of QA has witnessed significant progress with the advent of deep learning. Specifically, recurrent neural networks (RNNs) [1] have emerged as a popular choice for generating latent representations of natural language text passages, eliminating the need for feature extraction methods such as part-of-speech tagging, parsing, named entity recognition, and others. RNN- based models require less pre-processing and have shown promising results, matching or even surpassing other approaches. Among these models, the current state-of-the-art system is the dynamic memory network (DMN) proposed by Xiong et al [2].
The key to the success of the DMN lies in its unique combination of memory and attention mechanisms. DMN employs an iterative process to generate the answer to the input question, enabling the model to focus on both the input context and the question effectively. This approach enables the DMN to capture the relevant information from the input and construct a sound answer.
In this work, we use the DMN to answer specifically “why” questions in Arabic. We test the system on the LEMAZA dataset [3, 4], achieving an F-score of 78.61%.
The remainder of the paper is organized as follows. Section B covers the characteristics of the Arabic language and the challenges it poses to QASs. We focus on systems that are proposed to answer non-factoid questions in Arabic in section C. The details of DMN and its internal modules is presented in Section D. In section E, we present and discuss the results of the experiments. Finally, we conclude in Section F.
B. Arabic Langauge and the Challenges Posed
Arabic is one of the six official languages of the United Nations. It is the liturgical language of about 1.8 billion Muslims, of which 450 millions are Arabs [5]. As Arabic language content on the internet continues to grow and user demand for Arabic QASs increases, research in this field is becoming increasingly important. However, the progress in Arabic QAS research is still in its early stages when compared to advancements in English and Romance languages [6].
One of the challenges in Arabic language processing is the absence of capitalization, making named entity recognition more difficult [5, 9]. An additional challenge arises due to the unconventional transliteration of non-Arabic names, as noted by [10] who repoted fifteen distinct spelings of the name “Condoleezza”
(former US Secretary of State) in Arabic. Additionally, the absence of diacritic signs in Arabic writing leads to high ambiguity, resulting in further difficulties in processing questions and extracting answers, particularly in factoid question answering systems [5, 8, 9, 10]. Furthermore, Arabic is a highly inflectional language, with prefixes such as articles, prepositions, and conjunctions adding to the complexity of answering Arabic questions since we are often required to find the root word of a given word to extract relevant documents and words for answering the question [5, 8].
Another challenge in Arabic NLP is the scarcity of linguistic resources, including tools, corpora, and dictionaries, making work in this field more challenging [9].
C. Related Works
For a comprehensive survey of Arabic QAS, readers are directed to consult [5]. In this section, we will focus on reviewing the QAS designed to answer non- factual questions in Arabic. Currently, there are very few works.
Azmi et al [3] introduced LEMAZA (the Arabic word for why), an Arabic QAS for answering “why” questions specifically, uses the Rhetorical Structure Theory (RST) method and applies five rhetorical relations, including Interpretation, Justification, Result, Explanation, and Base. It was tested using 110 questions from a dataset of 700 articles collected from OSAC corpus [7]. The system achieved a recall of 72.73%, and precision of 79.21%. In [4], the same authors compared their system against a baseline approach using a subset of 100 questions. The baseline approach is a general method used to address various types of questions, including factoid questions. To handle “why” questions, the authors customized their system to rank relevant passages retrieved through a combination of tf-idf (term frequency - inverse document frequency) and keyword frequency. The authors reported that the performance of the baseline method was 68% for both metrics (recall and precision), while their system, LEMAZA, achieved 71%, and 78% for the same metrics, respectively.
Ahmed et al [16] introduced an Arabic QA system that answers “how” and
“why” questions. Their system comprised of several stages, including question analysis, question expansion, document retrieval, and answer extraction. To retrieve relevant documents from the corpus, the authors utilized tf-idf weighting.
The performance of their system was evaluated on 40 “how” questions and an equal number of “why” questions. The authors reported a recall of 52% and precision of 61% for “how” questions, while for “why” questions, the system achieved a recall of 62% and precision of 67%.
AL-Khawaldeh [13] proposed EWAQ, an entailment-based Arabic QAS for answering "why" questions that use entailment metrics to rank answers extracted from search engines. EWAQ was evaluated on a set of 250 questions selected by Arabic native speakers and achieved an overall accuracy of 68.53%.
Sadek and Meziane [14] developed a new text parser for answering “how”
and “why” questions using RST with three relations: Result, Reason, and Interpretation. The system answered 61 out of 90 questions correctly, achieving 67.7% accuracy with an MRR of 0.62%.
Regarding deep learning techniques, SOQAL [15] is an open-domain Arabic QAS that includes a retrieval module, a machine reading comprehension module, and a ranking module. The second module employs QANet and BERT and achieves an F-score of 27.6%.
D. Proposed Model
In the section, we will explain the DMN and its modules in detail, as presented by Kumar et al [12]. It is a neural network model that processes the input context with question, iterates in several passes to find the episodic memory and the appropriate answer. As illustrated in Figure 1, DMN is composed of four modules: input module, question module, episodic memory module, and answer module. Each module consists of an RNN optimized for the corresponding sub- task. We will briefly cover each module. We use DMN to answer “why” questions.
Figure 1. General architecture of dynamic memory network (DMN) [12].
Input module
The input module encodes the sentences of the input context with an RNN.
The input of RNN is the embedding matrix. The hidden state at time t is updated as
ht = RNN(L[wt], ht-1). Where L is the embedding matrix and wt is the index of the t- th word in the context tokens.
Question module
The question vector is the last hidden state of the GRU used in this module.
Since the question is input in a natural language, we need this module to encode it into a vector representation Q which can be used in the episodic memory module.
The hidden state for this encoder is updated at each time t as qt = GRU(L[wt], qt-1), where L and wt are the same as in the Input module.
Episodic memory module
This module is the main part of DMN. It is responsible for extracting the relevant information between input question q and the context c used by the answer module to find the answer. As illustrated in Figure 2, The process in this module is done in several passes t, and the episodic memory m is updated after each pass.
Figure 2. The episodic memory module of DMN.
Initially, the memory vector is set to the question vector m0 = Q. Then, a gating function is applied to the sentences of the input context, the question vector Q, and the previous memory state as follows,
( )
(2) (1) (1) (2)
1
( , , ) [ , , , , , , , , , ]
( , , ) ( t anh( ( , , ) ))
( , , ).
T b T b
i i
t t
z c m q c m q c q c m c q c m c w q c W m
G c m q W W z c m q b b
g G s m - Q
= - -
= + +
=
o o
In each pass, the value of the episode ei is found by updating the GRU's hidden state as follows,
1 1
( , ) (1 )
.
i i i i i
t t t t t t
i i
T
h g GR U s h g h
e h
- -
= + -
=
The memory state is updated after each pass t as mt = GRU(e, mt-1). The final value after T passes mT, which contains all the relevant information required to answer the question, is sent to the answer module.
Answer module
The function of this module is to produce the answer of the input question. It receives the final memory vector M from the episodic memory module. Then, the RNN (Recurrent Neural Network) is used to decode this vector with the question vector Q by a concatenating operation, answer = [Q; M].
E. Results and Discussion
In this section we report the results of evaluating MDN model to answer “why”
questions in Arabic. The model will be trained and validated using DAWQAS dataset [18] and tested using the LEMAZA dataset [3]. More on that later.
To evaluate the performance of DMN in answering Arabic “why” questions, we use accuracy and F-score (F1). The accuracy (Acc) and F1 are the common metrics for the evaluation of a QASs. These measures were used in the Text REtrieval Conference (TREC) project series and different QASs in many languages [5].
The F1 score is computed as a rate of individual words in the prediction against those in the ground truth answer. It combines precision (P) and recall (R) in a single measure. Precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.
Given that A is the number of questions that were answered correctly and Q is the total number of questions, the accuracy and F1 score are given by:
1
/
2 / ( ).
A cc A Q
F PR P R
=
= +
We train our DMN model via Adam optimizer [17] using DAWQAS dataset [18], a dataset for Arabic “why” questions consisting of 3146 context-question-answer triplets. The dataset was divided into 90% (2830 triples) for training and 10%
(316 triples) for validation. And for testing we use LEMAZA dataset [3]. This dataset is composed of 110 “why” questions based on a corpus of 700 articles collected from Open Source Arabic Corpora (OSAC) [7].
The DAWQAS dataset underwent pre-processing prior to training, which included removal of punctuation, normalization, and stemming. Hyperparameters were fine-tuned during training through validation, and the optimized values are listed in Table 2. For word embedding, we utilized FastText [19] pre-trained vectors.
Table 2. The hyperparameter setting used for training DMN
No Hyperparameter Value
1 Hidden size 150
2 Embedding size 300
3 Learning rate 0.0005
4 Number of passes 5
5 Mini batch size 20
6 Dropout rate 0.6
The DMN model was evaluated on the LEMAZA dataset, consisting of 110 “why”
questions in Arabic. The system demonstrated an accuracy of 80.91%, correctly answering 89 questions, and an F1 score of 78.61%. Table 3 displays an example of a question that DMN accurately answered.
Table 3. Sample example of a question in LEMAZA dataset with DMN predicted answer and the ground truth answer
Question Predicted answer Ground truth
اذامل ثكي باهتلا قلحلا دنع
طلأا لاف
ببسب موثرج وأ قايسب ضرم
يزاهج وأ ب أ بابس ىرخأ ثتعيو
لافطلأا ةحي رشلا
ثكلأا عت ضر ا
ةباصلإل هب
امبرو هلكأ ةيذغأ ثغ ،ةفيظن
هدجاوتو نكامأب ماحدزلإا سرادملاك
،قئادحلاو هضرعتو
تلابقلل نم لك
فارطلأا هنإف ضرعتي ميثارجل دق
نوكتلا هيدل ةعانم اهدض
Table 4 provides a summary of the performance of both the DMN and LEMAZA systems [3] in answering 110 Arabic "why" questions from the same dataset.
Recall that the LEMAZA system is based on the RST theory. The F1 score for the LEMAZA system was calculated from recall and precision using the formula for F- score. DMN outperformed the LEMAZA system, resulting in an accuracy boosting of 5.5% and an F-score by 3.7%. Table 5 presents a sample instance in which DMN succeeded in providing the correct answer, whereas LEMAZA system did not.
Table 4. Performance summary of DMN vs LEMAZA system in answeing
“why” questions
Model Accuracy F1
DMN 80.9% 78.61%
LEMAZA 76.68% 75.83%
Table 5. Example of a question where DMN model got it right while LEMAZA system did not
Question DMN predicted answer LEMAZA predicted answer اذامل
دعت ةنادب لافطلأا ةرهاظ
؟ة ثطخ
لفطلا يذلا ناعي نم ةنمسلا نوكي
ثكأ ةضرع تلاكشمل ةيحص
وبرلاك
او تر عاف طغض مدلا
حسم صرح ءابلآا مهاسي ف ةرهاظ
ةنادب لافطلأا – ةيشخ ءابلآا ببستت
ف ر ضت ةحص لافطلأا تفل حسم
ناطيرب ثيدح نأ صرح نيدلاولا
لع مهئانبأ دحأ ملا ببس فتا ةرهاظ
ةنمسلا ة رشتنملا ةدشب يب لافطأ
مويلا
F. Conclusion
The task of answering “why” questions in natural language has been a long- standing challenge for question-answering (QA) systems. However, recent advances in neural network models, particularly the integration of attention and memory mechanisms, have shown promising results across various tasks. In this paper, we proposed a deep learning model based on the dynamic memory network (DMN) architecture to address this challenge in Arabic. We evaluated the performance of our model using the LEMAZA dataset (110 “why” question) and achieved an F-score of 78.61%. This result demonstrate the potential of DMN- based models in addressing the complexity of “why” questions in Arabic and other languages.
For future work, we plan to further improve the system by testing it on larger dataset.
G. References
[1] D. Mandic, & J. Chambers. Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley, 2001.
[2] C. Xiong, S. Merity, & R. Socher, “Dynamic memory networks for visual and textual question answering,” in International Conference on Machine Learning, 2016, pp. 2397–2406.
[3] A.M. Azmi & N.A. Alshenaifi, “LEMAZA: An Arabic why-question answering system,” Natural Language Engineering, vol. 23, no. 6, pp. 877–903, 2017.
[4] A.M. Azmi & N.A. Alshenaifi, “Answering arabic why-questions: Baseline vs. rst- based approach,” ACM Transactions on Information Systems (TOIS), vol. 35, no.
1, pp. 1–19, 2016.
[5] T.H. Alwaneen, A.M. Azmi, H.A. Aboalsamh, E. Cambria, & A. Hussain, “Arabic question answering system: a survey,” Artificial Intelligence Review, pp. 1–47, 2021.
[6] S.K. Ray, & K. Shaalan. “A review and future perspectives of arabic question answering systems.” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3169-3190, 2016.
[7] M.K. Saad & W.M. Ashour, “OSAC: Open source Arabic corpora,” in 6th International Conference on Electrical and Computer Systems, vol. 10. European University of Lefke. Lefke, North Cyprus, 2010, pp. 25–26.
[8] Z.M. Mannaa, A.M. Azmi, & H.A. Aboalsamh, “Computer-assisted i‘raab of Arabic sentences for teaching grammar to students,” Journal of King Saud University- Computer and Information Sciences, vol. 34, no. 10, pp. 8909–8926, 2022.
[9] H. Samy, E.E. Hassanein, & K. Shaalan, “Arabic question answering: a study on challenges, systems, and techniques,” International Journal of Computer Applications, vol. 181, no. 44, pp. 1–10, 2019.
[10] A.M. Azmi, & E.A. Aljafari, “Universal web accessibility and the challenge to integrate informal Arabic users: a case study.” Universal Access in the Information Society, vol. 17, pp. 131-145, 2018.
[11] M.M. Biltawi, S. Tedmori, & A. Awajan, “Arabic question answering systems:
gap analysis.” IEEE Access, vol. 9, pp. 63876-63904, 2019.
[12] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, & R. Socher, “Ask me anything: Dynamic memory networks for natural language processing.” In 33rd International Conference on Machine Learning, PMLR, vol. 48, pp. 1378-1387, 2016.
[13] F.T. AL-Khawaldeh, “Answer extraction for why Arabic questions answering systems: EWAQ,” World of Computer Science and Information Technology J, vol.
5, no. 5, pp. 82-86, 2019.
[14] J. Sadek, & F. Meziane, “A discourse-based approach for Arabic question answering,” ACM Transaction on Asian and Low-Resource Language Information Processing (TALLIP), vol. 16, no. 2, pp. 1-18, 2016.
[15] H. Mozannar, K.E. Hajal, E. Maamary, & H. Hajj, “Neural Arabic question answering”. In: Proceedings of the 4th Arabic natural language processing workshop. Association for Computational Linguistics, 2019.
[16] W. Ahmed, & BabuAnto P, “Answer extraction for how and why questions in question answering systems”. International Journal of Computational Engineering Research (IJCER), vol. 12, no. 6, pp. 18-22, 2016.
[17] D.P. Kingma, & J. Ba, “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980, 2014.
[18] W.S. Ismail, & M.N Homsi. "DAWQAS: A dataset for arabic why question answering system," Procedia computer science, vol. 142, pp. 123-131, 2018.
[19] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, & T. Mikolov, “Fasttext.
zip: Compressing text classification models”. arXiv preprint arXiv:1612.03651, 2016.