of ad

(1)

A TECHNIQUE OF PROBABILITY IN DOCUMENT SIMILARITY COMPARISN IN INFORMATION RETRIEVAL SYSTEM Poltak Sihombing¹,Abdullah Embonj(, Putra Sumad.

1Program Studi llmu Komputer, Universitas Sumatera Utara, Medan, Indonesia 2.3School ofComputer Science, Universiti Sains Malaysia, Penang, Malaysia

J[email protected],' 2awes. usm.mv; 3putrams.usm.mv

Abstract

Nowadays, storing of documents in an infonnation retrieval system is no longer an issue due to the availability of huge storage space, multiple storage devices and different storage media, and the occurrence of various methods of document storage. The challenge is more on the retrieval of the documents since documents stored in a database grow very fast and soon become unmanageable. Inthis paper we propose a technique of probability to retrieve a document from one or more databases based on a similarity measure. The similarity measure is calculated using Jaccard formulation. Jaccard's score is used to represent a general measurement of document similarity. We have implemented a prototype of an information retrieval system based on genetic algorithm.

This algorithm is basically based on natural biological evolution. The parent solution (chromosome) with the higher level of fitness has a bigger probability to reproduce, while those with lower level of fitness have less probability to reproduce. Documents with a higher Jaccard's score reflect a higher probability of similarity. Application of this technique will facilitate searching and retrieval of required document from one or more databases based on the representation of the similarity level.

Keywords: probability technique, similarity level representation, document retrieval, Genetic Algorithm.

1. Introduction

In the past few decades, documents stored in a database in Infonnation Retrieval System (IRS) grow very fast. The most important problem in IRS is to get most relevant document from a database. Some times the user are unable to retrieve the required document. To solve this problem, researchers have implemented some methods such as inverted index, Boolean querying techniques, knowledge-based techniques, neural network, probabilistic retrieval techniques and machine learning approach [1].

Probabilistic retrieval techniques have been used to improve the information retrieval performance. The approach is based on two main parameters, the probability of relevance and the probability of irrelevance of a document

W.

Machine learningapproach is a new paradigm which has attracted attention of researchers not only in IRS but olso in artificial intelligence, computer science, and other functional

(.I •

.

⁽

(2)

disciplines such as engineering, medicine, and business

m

^[1].^Incontrast toperformance systems which acquire knowledge from human experts, machine learning systems acquire knowledge automatically from examples, i.e., from source data. The most frequently used techniques include symbolic, inductive learning algorithms, _l~, multiple-layered, feed- forward neural networks such as Backpropagation networks \Q], and evolution-based genetic algorithms (GA) [1]

rnJ

In this paper, we proposed a technique of probability in document similarity comparisn in information retrieval system by using genetic algorithm. By similarity comparison we can compare the rangking of the relevant documents by taking the similarity level as a result of retrieval. Our objective is to developed the probability technique into the concept of artificial intelligencia (AI) and GA and implementeditin IRS.

2. Genetic Algorithm

GA is one of the branches ofArtifiCial Intelligence representing a computational model which is inspired by the evolution theory. In GA, the following steps are repeated until a solution is found [1] 1]]:

a. Forming chromosome model.

Before GA can be used to solve a problem, a way ofencoding a potential solution to the problem must be found. A chromosome is formed by gene which represents bit (0 and I).

In IRS the keywords used in the set of user-selected documents were first identified to represent the underlying bit strings for the initial population. Each bit represents the same unique keyword throughout the complete GA process. When a keyword is present in a document, the bit is set to I, otherwise it is set to O. Each document could then be represented in terms of a sequence of Os and Is which is called a chromosom model, for examole: 01101010110101101. At the beginning of a run of a genetic algorithm a large population ofrandom chromosomes is created. A Chromosome model depend on the case, example given the following equation as a case:

a+b

*

c - d

*

e = 100

At the equation searched a variable score a, b, c, d, and e to be getting 100. Supposing maximum score to each variable is 15, meaning in binary representation there are four bits to each score. Because there are five (5) variables, hence chromosome length is 20 bit.

Taking example at one particular solution / chromosome, score of variable a is 10(1010_2), variable b is 5(010h), variable c is 14( III O2), variable d is 2(00 102), and variable e is 9(10012),hence chromosome at the solution is [1][ID:

10100101111000101001 b. Forming Early Population At Random.

Population represents corps of chromosome. The ancestors which represent early population is formed randomly. The amount ofthe early population do not have directive, according to existing problems and ability of computer. Example of early population for the case above is as follows [ZJ [~:

o

011011011010100101 11(a=6, b=13, c=IO, d=9, e=7) o 01011010110101101010(a=5, b=lO, c=13, d=6, e=IO)

o

11011011110010100100(a=13, b=l1, c=12, d=IO, e=4)

o

01011010001010100110(a=5,b=10,c=2,d=10,e=6)

o

11011011110110101 11 l(a=13, b=l1, c=13, d=lO, e=15)

2

(3)

c. Evaluating Fitness To Each Chromosome.

The objective of this phase is to obtain score of fitness to be used in selecting chromosome for the next generation. Fitness evaluation depends on the case and/or problems. For example, ifa = 0, b=O, c=O, d=15, e=15, then the worst score is -325 because it needs a score of 325 to get the expected score of 100, the calculation can be seen below:

I

0 + 0 * 0 1-115 * 15 1= 100 10

I-I

225

I

= 100 Fitness' score is calculated on below:

Fitness'score =

I

P

I- I

Q - R

I;

P = negation of worst score; Q = expected score; R = result of the equation with substitution for chromosom variable. The calculation fitness'score for the example of previous population can be seen at following tables 1.1 below [l]

rnl:

Tables 1 Calculation fitness' score

Chromosome Variable score Result of

Fitness's score Equation

(A) a=6, b=13, c=10, 6 + 13 * 10 - 9 * 7 325 -1100-731 =

01101101101010010111 d=9, e=7 =73 298

(B) a=5, b=lO, c=13, 5+10*13-6*10 325 -1100 - 751

01011010110101101010 d=6, e=1O = 75 = 300

,

(C) a=13, b=l1, c=12, 13 + 11 * 12 - 10 * 325 -1100 - 105

I

11011011110010100100 d=lO, e=4 4 = 105 = 320

(D) a=5, b=10, c=2, 5 + 10 * 2 - 10 * 6 325 -1100 - (-35)

01011010001010100110 d=lO, e=6 = -35 1=190

(E) a=13, b=l1, c=13, 13+11*13-10* 325 -1100 - 61 =

11011011110110101111 d=lO, e=15 15 = 6 231

d. Determmatlon Of Population

Determination of population on next generation based on fitness'score. The higher level of fitness has a bigger probability to reproduce, while those with lower level of fitness have less probability to reproduce. Commonly, tillS probability is selected by Roulette wheel selection method.

In the previous example, total fitness' score is: 298 + 300 + 320 + 190 + 231 = 1339. Hence, level of cutting to each chromosome is[l] [ID [2]:

o

298/1339 * 100 % = 22 % 0 190/1339 * 100 % = 14 %

o

300/1339 * 100 % = 22% 0 231/1339 * 100 %= 17 %

o

320/1339 * 100%= 25%

(4)

The roulette wheel selection can be seen at the following figure below:

E 17%

B 22%

A 22%

faA k1B 0C IDD E1E

CI 11: residing in region A

CI 60: residing in region C Fig. 1 Roulet/Circle diagram of fitness.

From the roulette/circle its found that score from 0 to 22 is belong to chromosome A, score from 22.1 to 44(44=22+22) belong to chromosome B, score from 44.1 to 69(69 = 44 + 25) belong to chromosomeC, score from 69.1 to 83 (83=69+14) belong to chromosome ofD, and score from 83.1 to 100 (100=83+17) belong to chromosome E. By intake of random score with range 0 to 100 counted 5 times, hence:

CI 42: residing in region B

CI 83: residing in region D

CI 33: residing in region B

Hence, the next generation populations are:

CI 01011010110101101010 (chromosome of B)

CI 01011010001010100110 (chromosome of D)

CI 01011010110101101010 (chromosome of B)

CI 011011011010 10010 III (chromosome of A)

CI 11011011110010100 100 (chromosome ofC)

e. Crossover.

This is simply the chance that two chromosomes will swap their bits. A good value for this is around 0.7. Crossover is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point.

e.g. Given two chromosomes

10001001110010010 01010001001000011

Choose a random bit along the length, say at position 9, and swap all the bits after that point so the above become[]J [[):

10001001101000011 01010001010010010 f.Mutation

This is the chance that a bit within a chromosome will be flipped (0 becomes 1, 1 becomes 0). Whether a mutation is done or otherwise is determined by a constantPm .This constant expresses the opportunity for mutation happening. The value of the constantPmusually is set to be very low, for example 0.01.

Whenever chromosomes are chosen from the population the algorithm first checks to see if crossover should be applied and then the algorithm iterates down the length of each chromosome mutating the bitsifapplicable. Iteration is done for each chromosome by taking a rondom score of 0 to 1.

Ifrandom score yielded is<=Pm, then the gene/bit is inversed, otherwise nothing is done. []J

lID l2.I li!I.

Taking example there is a chromosome:

01101101010110101101

4

(5)

Is done by iteration at every bit and done by intake of random score 0 to 1:

o Bit 1 : 0.180979788303375 0 Bit 11 : 0.2264763712883 oBit 2 : 0.64313793182373 0 Bit 12 : 0.518977284431458 oBit 3: 0.517344653606415 0 Bit 13 : 0.477839291095734 oBit 4: 0.0501810312271118 0 Bit 14: 0.529659032821655

oBit 5: 0.172802269458771 0 Bit 15 : 0.27648252248764

oBit 6 : 0.0090474414825439 0 Bit 16 : 0.266663908958435

OBit 7: 0.225485235254974 0 Bit 17: 0.791664183139801

oBit 8: 0.128151774406433 0 Bit 18 : 0.167530059814453 o Bit 9: 0.581712305545807 0 Bit 19: 0.874832332134247 oBit 10 : 0.173850536346436 0 Bit 20: 0.0878018701157 Hence chromosome result of mutation shall be as follows:

0110111101011010110 g. Next Generation evaluation.

In this phase,allpopulation is evaluated whether they have reached the expected solution. Ifnot yet, hence returning to step (c) and done repeatedly until got the expected solution.

In the previous example, this algorithm will stopifone of the new chromosome in population get score 100 at the equation. For example, in a new generation if one of the chromosome is 10001010110001000111 (a=8, b=lO, c=12, d=4, e=7), result of equation: 8+10 * 12 - 4 * 7 = 100.

Hence this chromosome is expected solution, and genetic algorithm stop at this phasef1]

mJ

[2][l±).

3. GA's application in IRS

In this research design, a keyword represents a gene (a bit pattern), a document's list of keywords represents individuals (a bit string), and a collection of documents initially judged relevant by a user represents the initial population. Based on a Jaccard's score matching function (fitness measure), the initial population evolved through generations and eventually converged to an optimal (improved) population - a set of keywords which best described the documents. A similarity approach of document was adopted by Jaccard's score to compute the "fitness" of subject descriptions for information retrieval [Hl

3.1 Algorithm

Let's say there areNchromosomes in the initial population. Then, the following steps are repeated until a solution is found

1. Test each chromosome to see how good it is at solving the problem at hand and assign afitness score accordingly. The fitness score is a measure of how good that chromosome is at solving the problem.

2. Select two members from the current population. The chance of being selected is proportional to the chromosomes fitness. Roulette wheel selection is a commonly used method.

3. Dependent on the crossover rate, crossover the bits from each chosen chromosome at a randomly chosen point.

4. Step through the chosen chromosomes bits and flip dependent on the mutation rate.

5. Repeat step 2, 3, 4 until a new population ofNmembers has been created.

3.2 Implementation

As mentioned before, we have implemented a prototype of Journal Browser to retrieve a similar document from database by using Jaccard formulation as fitness' score.

(6)

Jaccard's score is formulated below:

#(X nY)

#(XuY)

Where #(S) showing number of element in S.

For example:

S= {a, b, c, d, e, f, g,h,i,j}

if X= {a, b, e,g,h,i,j}

and Y= {b, c, d, f, g,j}

XuY = {a,b,c,d,e,f,g,h,i,}}

X nY = {b,g,}}

:. #(XnY) =~=0.3

#(XuY) 10

Jaccard's score represents common measurement at genetic algorithm. From this formula it can be seen that if#(X) is equal to #(Y) then Jaccard's score is 1. That mean ifX element it is equal to Y fitness'score is 1, meaning X document precisely same with Y document, although this matter is very difficult founded in database.

Ifa keyword is present in a document, the bit is set 1, otherwise O. Each document could then be represented in terms of a sequence of Os and Is. We have computed the fitness of each document based on its relevance to the documents in the user-selected set. Document with a higher Jaccard's score reflect a higher probability of similarity. Application of this technique will facilitate searching and retrieval of required document from one or more databases on the representation of similarity level

[ul [14].

Form Journal Browser

a

/

b

6 c

C_)

(7)

k

t

evant lule Discovery by LanCJU8ge Bias for Suantic Query Optai:ation rds ; 1>atabase., Language, Bias, Ceneraeor. KnovJ.eclge. Discovery, Rule,

ic , Query. OptiJdzation

Fig. 2 Interface Fonn Journal Browser.

4. The Result

In the testing a query we choose randomly, and then we used classification of query. Experiment indicates that percentage level of document as result of retrieval is consistent at certain range, although rangking value change. We mean that documents residing at top level, remain to have top level at its size measure.

There is tendency a query which envolves of mutation and crossover process will result an better document retrieval (has higher percentage of documents retrieval), while those have less mutation have lower percentage of document retrieval.

The parent solution (chromosome) with the higher level of fitness has a bigger probability to reproduce, while those with lower level of fitness have less probability to reproduce. Documents with a higher Jaccard's score reflect a higher probability of similarity.

5.Conclusion

Research in Infonnation retrieval has been advancing very rapidly over the past few decades.

Researchers have experimented with the techniques ranging from probabilistic models and the vector space model to the knowledge-based approach and the recent machine learning techniques. At each stage, significant insights regarding how to design more useful and "intelligent" information retrieval

(8)

systems have been gained. Currently many IRS research are based on machine learning techniques.

Symbolic machine learning and genetic algorithms further are two popillar candidates for adaptive learning in other applications. This paper has discussed a technique of probability in document similarity comparisoninIRS. We hope the system can be developed to improve the success in precise indocument retrieval.

Reference

1. Chen, HsinchlUl, Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms, Artificial Intelligence Lab, Eller College of Management, The University of Arizona, 1995.

http://ai.bpa.arizona.edu/papers/mlir93/mlir93.htI.111,

2. RK.Be1ew.Adaptive information retrieval. In Proceedings of the Twelfth Annual International ACMISIGIR Conference on Research and Development in Information Retrieval, pages 11-20,

Cambridge, MA, June 25-28, 1989.

3. R E. Stepp and R S. Michalski. Conceptual clustering: Inventing goal-oriented classifications of structured objects. InMachine Learning, An ArtifiCial Intelligence Approach, Vol. II, pages 472- 498, Pages 463-482, Carbonell, 1. G., Michalski, R. S., and Mitchell, T. M., Editors, Morgan Kaufmann, Los Altos, CA, 1987.

4. S. M. Weiss and C. A.Kulikowski. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Networks, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1991.

5.1.R Quinlan.C4.5: Programs for Machine Learning. Morgan Kauffmann, Los Altos, CA, 1993.

6. D. E. Rumelhart, B. Widrow, and M. A. Lehr. The basic ideas in neural networks.

Communications ofthe ACM, 37(3):87-92, March 1994.

7. Hsiung, Sam, An Introduction to Genetic Algorithms, generationS, 2000, http://Vv·ww.genemtion5.org/content/2000/ga.asp,

8. Anonym, Genetic Algorithm Tutorial, ai-junkie, 2000.

http://w.vw.ai-junkie.com!ga/intro/gatl.htmI.

9. D. E. Goldberg. Genetic and evolutionary algorithms come of age. Communications of the ACM, 37(3):113-119, March 1994.

10.H. Chen and K.1.Lynch. Automatic construction of networks of concepts chamcterizing document databases.lEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, September/October 1992.

11.H. Chen, K. 1. Lynch, K. Basu, and D.T. Ng. Generating, integmting, and activating thesauri for concept-based document retrieval.IEEE EXPERT, Special Series on Artificial Intelligence in Text- based Information Systems, 8(2):25-34, April 1993.

12. H. Chen and 1. Kim. GANNET: a machine learning approach to document retrieval. Journal of Management Information Systems, 11(3):7-41, Winter 1994-95.

13. M. Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208-1218, October 1988.

14. M. D. Gordon. User-based document clustering by redescribing subject descriptions with a genetic algorithm.Journal ofthe American Society for Information Science, 42(5):311-322, June 1991.

Lampimn

LAPORAN I N F O R M A T I O N R E T R I E V A L (InfoDmation Retrieval Report) .

20/3/2005 8:15:12 AM

8

(9)

Pilihan (Choice): HCI

Untuk 2 Document sebagai Query (Query II)

2 Documents as a Query

1. Human-Computer Interaction for Alert Warning and Attention Allocation Systems of the Multi-Modal Watchstation

Keywords : HCI, Alert, Warning, Interruption, interaction

2. A Tracking Framework for Collaborative Human Computer Interaction Keywords : HCI, 3D, interaction, human

Himpunan Kata Kunci Keseluruhan (Overall of Keyword Gathering) HCI, Alert, Warning, Interruption, interaction, 3D, human

Generasi ke-1 (1^st Generation)

Populasi Awal (Early Population):

1111100 [0.642857142857143]

1000111 [0.642857142857143]

Populasi Sesudah Seleksi (Population After Selection) : 1111100

1000111

Populasi Sesudah Persilangan (Population After Crossing) : 1111100

1000111

Proses Mutasi(Process of Mutation):

Populasi Sesudah Mutasi (Population After Mutation) : 1111100

1000111

Generasi ke-2 (2~ Generation)

Populasi Awal (Early Population) : 1111100 [0.642857142857143]

1000111 [0.642857142857143]

1111100

Proses Persilangan (Process of Crossing):

Kromosom 2 dengan kromosom 1 , posisi 5 (Cromosom 2 with cromosom 1, position 5) Populasi Sesudah Persilangan (Population After Crossing) :

1000100 1111111

Populasi Sesudah Mutasi (Population After Mutation) :

(10)

1000100 1111111

Generasi ke-3 (3^rd Generation)

1111111 [0.642857142857143]

1000100

Proses Persilangan (Process of Crossing) :

Kromosom 1 dengan kromosom 2 , posisi 2 (Cromosom 1 with cromosom 2, position 2) Populasi Sesudah Persilangan (Population After Crossing) :

1100100 1011111

1011111

Generasi ke-4 (4^th Generation)

1011111 [0.619047619047619]

1011111

Generasi ke-5 (5~ Generation)

1011111 [0.619047619047619]

10

(11)

1011111

Generasi Akhir (Last Generation) Solusi (Solution): 1011111 Kata kunci (Keyword):

HCI, Warning, Interruption, interaction, 3D, human

Persentase Kemiripan Dokumen Hasil Retrieval (Percentage of Documents Retrieval Result) :

[37,50%] Human-computer interaction Through computer vision

[33,33%] Enabling Implicit Human Computer Interaction A Wearable RFID-Tag Reader [25,00%] Human-Computer Interaction Standards

[25,00%] Vision-based sensor fusion for Human-Computer Interaction [22,22%] New Human-Computer Interaction Techniques

[22,21%] The Role of Electrophysiology in Human Computer Interaction [22,21%] Visual Communication and Interaction

[20,00%) Analysing Human-Computer Interaction as Distributed Cognition: The [20,00%] Implicit Human Computer Interaction Through Context

[20,00%] Towards Conversational Human-Computer Interaction