EVALUATION OF RETRIEVAL
EFFECTIVENESS USING CLUSTERING TECHNIQUES IN MALAY DOCUMENT
RETRIEVAL
NURAZZAH ABD RAHMAN
Thesis submitted in fulfillment of the requirements for the degree of
Doctor of Philosophy
Faculty of Computer and Mathematical Sciences
April 2011
iii
ABSTRACT
Information Retrieval (IR) deals with the representation, storage, organization of and access to information items. The main function of information retrieval system is to provide the users with tools to perform searching effectively and efficiently. For the past thirty years since research on IR has been established, research on IR using Malay Language has only emerged in the middle of 1990s. Cluster Analysis is a technique for multivariate analysis that assigns items to automatically created groups based on a calculation of the degree of association between items or group. In cluster- based information retrieval, clustering can be applied to terms in documents, or all documents in the corpus, or the user queries or the retrieval results itself. Each type of clustering will improve the retrieval effectiveness. This thesis focuses on document clustering. The Malay documents corpus consists of digitized Malay translated hadith text from well-known Islamic scholars, which are Sahih Muslim, Sahih Bukhari, Sunan Ibnu Majjah, Sunan At-Tirmidzi, Sunan Abu Daud and Sunan An-Nasaie. The corpus was developed by scanning, editing and proofreading the Malay text into digital form. Pre-processing for Malay translated hadith text need to be executed as most of the texts are in Indonesian Language. Differences in the meaning of many terms need to be clarified and converted to Malay language using dictionary and also human experts in both languages. Experts in the Hadith domain is sought after for reliability of the Malay translated Hadith text documents. A digitized updated Malay thesaurus is used in the first experiment to improve the effectiveness of Malay document retrieval. For Clustering Analysis, the Malay translated hadith test collection consists of 2028 documents from Sahih Bukhari, where each Hadith document consists of words ranging from 13 to 2561. The determination of inter- document similarity depends on both the document representation in terms of the weights assigned to the indexing terms characterizing each document and the similarity coefficient chosen. This thesis presents the results of applying five different hierarchical agglomerative clustering techniques, namely Single Linkage, Complete Linkage, Group Average Linkage, Weighted Median Linkage and Ward's Method, using Dice, Jaccard and Cosine similarity coefficients on Malay corpus. The evaluation of the experiments uses redefined well-known IR metrics Recall (R), proportion of relevant documents that is clustered, and Precision (P), proportion of clustered documents that are relevant. The results of first experiment obtained shows that by using Dice similarity coefficient, Complete Linkage is the most effective and Average Linkage is highest in precision, in clustering Malay translated Hadith text documents. By using Jaccard similarity coefficient, Single Linkage is the most effective in clustering Malay translated Hadith text documents, while Ward's Method is the highest in precision. Lastly, by using Cosine coefficient, Complete Linkage gives the highest precision in clustering Malay translated Hadith text documents.
Therefore, Complete Linkage combined with Cosine coefficient is used to run on a larger Malay Hadith corpus in the second experiment, which is Sahih Bukhari that consists of 2028 text documents. Different testing proved that the Precision is increased from 18% to 55% if the corpus is clustered into 100 clusters, compared to 50 and 20 clusters. This has led to the conclusion that larger the number of clusters has higher precision compared to smaller number of clusters, since larger number of clusters has smaller number of documents in each cluster. Hence, recall is decreased and precision increased.
A C K N O W L E D G E M E N T S
In the name of Allah, The Most Gracious, Most Merciful. Praise to Allah, The One and Only for showering me strength, a good health and experience throughout this study period to complete this study. It is with His ascendency the study is completed.
I am indepted to many individuals who directly or indirectly are responsible for the possibility to complete this thesis. My hearthiest gratitute and gratefulness to my supervisor Prof. Dr. Hjh. Zainab Abu Bakar whose keenness, patience and trust in the research and wise supervision has benefited me greatly. Her encouragement has led to the publication of seven research articles during the period of my study, and three awards for international innovation and technology exhibitions held locally and internationally. My sincere thanks and appreciation extended to my second supervisor Professor Dr Tengku Mohd Tengku Sembok from Universiti Kebangsaan Malaysia, for his helpful suggestions and continuous encouragement and support.
I would also like to thank Ustaz Ezani bin Yaakub and Ustaz Mohd Takiyudin bin Hj Ibrahim from Center of Islamic Thoughts and Understanding, Universiti Teknologi MARA for assisting me in the understanding and development of Malay translated hadith test collection and Malay Hadith relevant judgment used in this research. Not forgeting Encik Azman bin Ismail and Cik Aslina Bind Saleh from Saba Islamic Media Sdn. Bhd., Kompleks PKNS, Shah Alam, for providing Malay Hadith resources such as books and CDs for my references.
My sincere thanks also go to my colleagues, lecturers, and supporting staff of the Faculty of Computer Science and Mathematics, and all staffs of Centre of Graduate Studies, UiTM for providing me with the necessary facilities. Also to students of Degree in Information Technology, Degree of Computer Sciences, Hasmiza, Nisrin, Kamariah, Norsharizan and many others that mentioning one by one would make a long list.
Nevertheless, thanks to them all.
I would also like to thank here personally to Professor Hj. Muhd Kamil Ibrahim, lecturer and ex-Director of UiTM Johor, for referring and citing the MUTIARA HADIS search engine in his book, Travelog Dakwah: Meniti Hari Esok. His citation has increase the hits of the Malay Hadith search engine roaring high, as many readers of his book have searched and used the search engine.
To my sponsor, University Teknologi MARA, thank you for the financial assistance rendered throughout the research period.
I am deeply grateful to my respected father and mother whose prayers and persistent support made it much easier to accomplish this work.
Last but not least, I would like to thanks my dearest husband Mohd Haslan Abd Rahim, and my beloved children: Asma', Zaid, Ammar, Anas, Wafa', Mu'adz, Zubair and Hana' for their continuous encouragement, patience, inspiration, and support during the course of this thesis.
TABLE OF CONTENTS
v
TITLE PAGE
AUTHOR'S DECLARATION ii
ABSTRACT iii ACKNOWLEDGEMENTS iv
TABLE OF CONTENTS v LIST OF TABLES x LIST OF FIGURES xiii
CHAPTER 1: INTRODUCTION
1.1 Research Background 1 1.2 Problem Description 3 1.3 Research Objectives 5 1.4 Research Scope 6 1.5 Research Contributions 8
1.6 Organization of The Thesis 8
CHAPTER 2: LITERATURE REVIEW
2.1 Introduction 11 2.2 Previous Researches on Malay Information Retrieval 13
2.3 Cluster Analysis 20 2.3.1 Applications in Other Disciplines 21
2.3.2 Applications in Computer Sciences and Sub-Areas 28
2.3.3 Applications in Information Retrieval 33
2.4 Information Retrieval Models 39
2.4.1 Boolean Model 41 2.4.2 Vector Space Model 43 2.4.3 Probabilistic Model 46 2.5 Existing Hadith Search Engine 48
2.5.1 Hadith Search Engine On Web 49
2.5.1.1 MSA-USC Hadith Database 49 2.5.1.2 Compendium of Muslim Texts 51 2.5.1.3 IslamOnline.net(1999-2006) 54
2.5.1.4 Search Truth 55 2.5.1.5 The Muslim Internet Directory 55
2.5.1.6 Guided Ways Hadith Search Engine 57 2.5.1.7 Jabatan Kemajuan Islam Malaysia (JAKIM) 58
2.5.1.8 Summary 59 2.5.2 Hadith Software 60
2.5.2.1 Al-Bayan Version 2.01 61 2.5.2.2 Hadith Database Version 1.1 63 2.5.2.3 Selections of Hadith Version 1.0 65 2.5.2.4 Hadith Encyclopedia Version 3.0 66 2.5.2.5 The Hadith Software Version 1.0 67 2.5.2.6 Summary of Hadith Software 68
2.6 Summary 69
CHAPTER 3: TEST COLLECTION AND THEIR CHARACTERISTICS
3.1 Introduction 71 3.2 Hadith Test Collections 75
3.2.1 Corpus of Malay Translated, Hadith Documents 76
3.2.2 Stop Word List 77 3.2.3 Morphological Rules 80 3.2.4 Dictionary of Malay Root Words 80
3.2.5 Malay Natural Language Query 81
3.2.6 Malay Thesaurus 83 3.2.7 Hadith Relevant Judgment 84
3.3 Zipf s Law 85 3.3.1 Validation of Zipf s Law on Malay Language 87
3.4 Stemming Algorithm 96 3.5 Inverted Index File 98