• Tidak ada hasil yang ditemukan

Using Mutual Information

N/A
N/A
Protected

Academic year: 2024

Membagikan "Using Mutual Information "

Copied!
30
0
0

Teks penuh

(1)

Using Mutual Information

Technique in Cross Language Information Retrieval

Syandra Sari Mirna Adriani

Trisakti University University of Indonesia [email protected] [email protected]

ICADL 2008 Bali, 3 Sept 2008

(2)

Outline

1. Introduction 2. CLIR

3. Related Research 4. Our Work

5. Mutual Information & QE 6. Experiment

7. Result and Analysis

(3)

Syandra Sari 3

Introduction

• The explosive growth of the World Wide Web

• Multilingual world in the Internet.

Stimulated the CLIR research area

(4)

C L I R

• The challenge in this area is to overcome the language barrier

between the query and the document collection.

• needs to transform queries and documents into a common

representation, so that monolingual IR

techniques can be applied

(5)

Syandra Sari 5

C L I R

Two approaches in CLIR:

• translate the query into the language of the documents or

• translate the documents into the language

of the query.

(6)

C L I R

Techniques for translation process:

1.Machine translation 2.Bilingual dictionary

3.Parallel or comparable corpus

(7)

Syandra Sari 7

Related Research

• Yiming Yang et. al (1998):

– Created a corpus-based term-equivalence matrix extracted automatically from bilingual corpora

– English-Chinese CLIR and English-Japanese CLIR

• Lavrenko et. al (2002):

– Applied language model for CLIR using parallel corpus – Chinese-English CLIR

• Martin Braschler (2004):

– Used similarity thesauri for query translation.

– Some European language CLIR (English, Italian, German, French, Spanish)

(8)

CLIR for Indonesian Language

• In our earlier study (Hayurani et.al, 2006):

– Indonesian-English CLIR

– Machine translation 84.82% of monolingual performance

– Bilingual dictionary 51.98% of monolingual performance

– Parallel corpus (using bilingual dictionary) 8% of monolingual performance

(9)

Syandra Sari 9

Our Work

• is to do the translation process for Indonesia-English CLIR using

parallel corpus

Pseudo translation

Mutual Information technique

Dictionary

Machine translations

• is aimed at evaluating language resources

and tools available for Indonesian-English

pair

(10)

Mutual Information

• In monolingual IR, was used for finding word association. (Church, 1990)

• In our work:

• For measuring the association degree between Indonesian and English word pair.

• The word pair that has highest mutual

information value is considered to be the best word pair.

(11)

Syandra Sari 11

Mutual Information

• Mutual information of two points (words), x and y, is defined to be:

P(x) is the occurrence probability of word x

P(y) is the occurrence probability of word y

P(x,y) is the probability that the words x and y occur together

I(x,y) is mutual information value

2

( , ) ( , ) log

( ) ( ) P x y I x y

P x P y

=

(12)

Mutual Information

• In (Church, 1990) and (Myung-Gill, 1999) mutual information value is computed based on word co- occurrence statistics and can be define as follows:

f(x) is the number of documents containing x in a corpus;

f(y) is the number of documents containing y in the corpus;

2

* ( , ) ( , ) log

( ) ( ) N f x y I x y

f x f y

=

(13)

Syandra Sari 13

Mutual Information

• We adapted the formula for Indonesian-English parallel corpus

2

( , )

( , )

log *( )

( ( ) 1)*( ( ) 1)

Indonesian English

I x y

D x y

N N

D x D y

=

+ + +

(14)

Mutual Information

D(x) is the number of Indonesian documents that contain Indonesian word x (exclusive);

D(y) is the number of English documents that contain English word y (exclusive);

D(x,y) is the number of Indonesian-English

document pairs that contain Indonesian word x and English word y;

NIndonesian is the number of items or words in the Indonesian corpus;

(15)

Syandra Sari 15

Query Expansion

• is process of adding words found in a

certain number of top English documents retrieved into the query

• We used language model formula in

choosing the best words to be added to

the query

(16)

Experiment (1)

• Query

– 50 Indonesian queries from CLEF 2006

• Merek Nestle Merek-merek apa yang dipasarkan oleh Nestle di seluruh dunia

• Nestlé Brands What brands are marketed by Nestlé around the world

• Collection

(17)

Syandra Sari 17

Experiment (2)

Building Indonesian-English Parallel Corpus

English corpus Indonesian

corpus

Machine Translation

INDONESIAN-ENGLISH PARALLEL CORPUS

BUILDING PARALLEL CORPUS

(18)

Experiment (3)

Indonesian Queries

English Queries

IRS Indonesian-English Parallel Corpus

(19)

Syandra Sari 19

Experiment (4)

Queries were also translated using:

• Bilingual dictionary

• Machine translations:

– Toggletext – Transtool

• Pseudo-translation based on parallel

corpus

(20)

Result and Analysis

• The Mean Average Precision (MAP) of

– monolingual English queries,

– the Indonesian queries translated using

• bilingual dictionary

• Toggletext machine translation

Technique MAP

Monolingual 0,3242

Dictionary 0,1685 (-48,02%) M. Translation

(Toggletext)

0,2750 (-15,18%)

(21)

Syandra Sari 21

Result and Analysis

The Mean Average Precision (MAP) of the Indonesian queries translated using parallel corpus : pseudo-

translation and mutual information technique

Technique MAP

Parallel-Pseudo Translation (PT) 0.2245 (-30.75%) Parallel-Mutual Information(MI) 0.1085 (-66.53%) Parallel- Mutual Information with

query expansion (MI-QE) 0.1357 (-58.14%)

(22)

Result and Analysis

• English word from the

highest value of MI

(first rank)

• 110 word from 260 word

Indonesian word

English word from MI (first rank)

merek trademark

keadaan circumstance pengangguran unemployment

visa visa

(23)

Syandra Sari 23

Result &

Analysis

• English word from 2nd to 5th value of MI

• 51 word

Indonesian word

English word from MI 1st rank to 5th rank

Correct English word

Rank

africa

afrikan, african,

kwazulu, AFRICA, gatsha

AFRICA 4

perawatan

aftercare, surgery, TREATMENT, carefree, arthritis

TREAT MENT

3

bijaksana

discreet, PRUDENT, wiser, wisdom,

indiscreet

PRU DENT

2

main

romp, overplay, PLAY playground, nut

PLAY 3

(24)

Result and Analysis

• Example of Indonesian words get wrong

English words as translation

• 44 word

Indonesian word

English word

from MI (first to fifth rank)

pengaruh

relentless,

unsuspected, mindful, favor, abscond

produksi

latch, lug, lubricant,

rapacity, sewage

(25)

Syandra Sari 25

Result and Analysis

• There are 55 Indonesian words did not get best / correct English words because

of the stemming process such as

:

1. Some Indonesian words get antonym word in English

• Example: “adil” was translated into “unjustice” using MI

(“unjustice” is antonym of “justice”).

2. Some Indonesian words get related word in English.

Example: “matahari” was translated into “sunlight”, “solar”

using MI (“sunlight”, “solar” are related to “sun”).

(26)

Result and Analysis

3. Some Indonesian words don’t get the best translation.

Example: “putri” was translated into “daughter using MI the best translation for “putri” in English is “princess.

4. Some Indonesian words get English word that is a translation for other variation of the Indonesian

words.

(27)

Syandra Sari 27

Result and Analysis

• There are 9 phrases but only 3 phrases get correct translation in English

Indonesian phrase In English Using Mutual Information

bahan bakar fuel firewood explosive

energi atom atomic energy atomic energy gerhana matahari solar eclipse solar eclipse

gempa bumi earthquake quake

lintas alam cross country undergo straightaway perdana menteri prime ministry morihiro

sidang pengadilan trial unjustice convocation

uang sekolah tuition tuition

undang-undang dasar constitution basic invitation

(28)

Conclusion

• We find that mutual information

– could rank the words in good order based on value of mutual information to get the best translation, however sometimes it gives the wrong or incorrect translation.

• Machine translation techniques are the best translation method so far

• Based on the result of our evaluation, there is still

(29)

Syandra Sari 29

Future Research

• explore better techniques in finding bilingual word pairs than mutual

information technique.

• Apply better query expansion technique to

improve the result

(30)

Referensi

Dokumen terkait