• Tidak ada hasil yang ditemukan

Development of a parallel clustering of bilingual corpora based on reduced terms

N/A
N/A
Protected

Academic year: 2024

Membagikan "Development of a parallel clustering of bilingual corpora based on reduced terms"

Copied!
2
0
0

Teks penuh

(1)

DEVELOPMENT OF A PARALLEL CLUSTERING OF BILINGUAL CORPORA BASED ON

REDUCED TERMS

LEOW CHING LEONG

ttRPUSTAKAAli

~sm ~ftUAYSfA

SABAH

THESIS SUBMITrED IN FULFILLMENT OF THE DEGREE OF MASTER OF SCIENCE

FACULTY OF COMPUTING AND INFORMATICS UNIVERSITI MALAYSIA SABAH

2015

UMS

UNIVEASITI MALAYSIA SABAH

(2)

ABSTRACT

Document clustering is a process that groups a set of documents based on their similarities. There are several studies related to document clustering. However, with the current technology, clustering bilingual text documents provides more benefits to users. There are several advantages when clustering bilingual corpus. It helps in verifying the classification and constraints of languages. Other than that, it also helps in eliminating the biased language-specific usages. However, not many works conducted that are related to clustering bilingual documents found, especially for Malay text articles. The quality of clustering bilingual text documents is highly influenced by the quality of the bag-of-word presentation of Malay text articles presented to the clustering algorithm. Hence, the aim of this study is to investigate the effects of reducing terms used in clustering bilingual text articles in English and Malay on the quality of clustering results. 500 news articles for both languages are retrieved manually from Bernama archieve and TheStar website. In order to achieve this, there are three outlined objectives. The first objective of this study is to improve the stemming process for Malay language by increasing the efficiency of stemming Malay words. By improving this stemming process (0.5%

error rate), the number of terms is also reduced and increases the quality of clustering results. The bag-of-word representation for Malay documents can also be improved by identifying the entities found in the text articles. By identifying the named-entity that exists in the Malay text articles, a better bag of words representation of text articles can be obtained by reducing the terms based on the named-entity recognition. The F-Measure obtain is 94.72%. Next, the second objective of this paper is to design an experimental setup that studies the effects of using different clustering linkages coupled with various proximity measurement techniques in clustering bilingual documents on the quality of clustering results.

The clustering linkages include the single, complete, average and centroid linkages and the proximity measurement techniques include the cosine similarity and extend Jaccard. Based on the findings obtained, the average linkage shows ideal clustering results compared to the other clustering linkages even though the single linkage shows a lower Davies-Bouldin Index (OBI) value. This is because the standard deviation of the number of documents for all clusters is low. Not only that, this study also shows that the extend Jaccard coefficient produces a better clustering results compared to the cosine Similarity. Finally, the third objective of this study is to investigate the effects of reducing the set of terms considered in clustering English and Malay documents. A Genetic Algorithm (GA) will be implemented to reduce the number of terms used. A set of relevant terms will be selected based on the GA based terms selection process. The parallel mapping percentages show an improvement when the number of terms reduced using the GA with different mutation rate.

v

UMS

UNIVERSITI MALAYSIA SABAH

Referensi

Dokumen terkait

Based on the comprehensive study of image segmentation technology, this paper analyzes the advantages and disadvantages of the existing fuzzy clustering algorithms; integrates the

This study aims at investigaing the effect of applying clustering technique on students’ ac hievement in writing analytical exposition text.. It was conducted by

In addition, The four element of Lexical cohesion occurred in Math Bilingual Text book have the main purpose to build the reader understanding about the explanation

In order to know the production process of director in pro - ducing bilingual documentary of “Barong” painting by I Nyoman Gunarsa by using expository style, hence qual -

The aim of the study is to prove whether there is a significant difference be - tween writing using clustering technique and writing without using it on the students’ writ-

The objective of the research is to find out the accuracy of the translation text issued in Physics bilingual text book for seventh grade of junior high school

Density Based Spatial Clustering of Applications with Noise DBSCAN [2] is a typical density-based clustering algorithm.. In this paper, we analyze the properties of density based

As Teng 2010 puts it, using bilingual languages of minorities in China has four different types: a For the member of a minority group, who masters in its own minority language and the