Using Mutual Information
Technique in Cross Language Information Retrieval
Syandra Sari Mirna Adriani
Trisakti University University of Indonesia [email protected] [email protected]
ICADL 2008 Bali, 3 Sept 2008
Outline
1. Introduction 2. CLIR
3. Related Research 4. Our Work
5. Mutual Information & QE 6. Experiment
7. Result and Analysis
Syandra Sari 3
Introduction
• The explosive growth of the World Wide Web
• Multilingual world in the Internet.
Stimulated the CLIR research area
C L I R
• The challenge in this area is to overcome the language barrier
between the query and the document collection.
• needs to transform queries and documents into a common
representation, so that monolingual IR
techniques can be applied
Syandra Sari 5
C L I R
Two approaches in CLIR:
• translate the query into the language of the documents or
• translate the documents into the language
of the query.
C L I R
Techniques for translation process:
1.Machine translation 2.Bilingual dictionary
3.Parallel or comparable corpus
Syandra Sari 7
Related Research
• Yiming Yang et. al (1998):
– Created a corpus-based term-equivalence matrix extracted automatically from bilingual corpora
– English-Chinese CLIR and English-Japanese CLIR
• Lavrenko et. al (2002):
– Applied language model for CLIR using parallel corpus – Chinese-English CLIR
• Martin Braschler (2004):
– Used similarity thesauri for query translation.
– Some European language CLIR (English, Italian, German, French, Spanish)
CLIR for Indonesian Language
• In our earlier study (Hayurani et.al, 2006):
– Indonesian-English CLIR
– Machine translation 84.82% of monolingual performance
– Bilingual dictionary 51.98% of monolingual performance
– Parallel corpus (using bilingual dictionary) 8% of monolingual performance
Syandra Sari 9
Our Work
• is to do the translation process for Indonesia-English CLIR using
– parallel corpus
• Pseudo translation
• Mutual Information technique
– Dictionary
– Machine translations
• is aimed at evaluating language resources
and tools available for Indonesian-English
pair
Mutual Information
• In monolingual IR, was used for finding word association. (Church, 1990)
• In our work:
• For measuring the association degree between Indonesian and English word pair.
• The word pair that has highest mutual
information value is considered to be the best word pair.
Syandra Sari 11
Mutual Information
• Mutual information of two points (words), x and y, is defined to be:
• P(x) is the occurrence probability of word x
• P(y) is the occurrence probability of word y
• P(x,y) is the probability that the words x and y occur together
• I(x,y) is mutual information value
2
( , ) ( , ) log
( ) ( ) P x y I x y
P x P y
=
Mutual Information
• In (Church, 1990) and (Myung-Gill, 1999) mutual information value is computed based on word co- occurrence statistics and can be define as follows:
• f(x) is the number of documents containing x in a corpus;
• f(y) is the number of documents containing y in the corpus;
2
* ( , ) ( , ) log
( ) ( ) N f x y I x y
f x f y
=
Syandra Sari 13
Mutual Information
• We adapted the formula for Indonesian-English parallel corpus
2
( , )
( , )
log *( )
( ( ) 1)*( ( ) 1)
Indonesian EnglishI x y
D x y
N N
D x D y
=
+ + +
Mutual Information
• D(x) is the number of Indonesian documents that contain Indonesian word x (exclusive);
• D(y) is the number of English documents that contain English word y (exclusive);
• D(x,y) is the number of Indonesian-English
document pairs that contain Indonesian word x and English word y;
• NIndonesian is the number of items or words in the Indonesian corpus;
Syandra Sari 15
Query Expansion
• is process of adding words found in a
certain number of top English documents retrieved into the query
• We used language model formula in
choosing the best words to be added to
the query
Experiment (1)
• Query
– 50 Indonesian queries from CLEF 2006
• Merek Nestle Merek-merek apa yang dipasarkan oleh Nestle di seluruh dunia
• Nestlé Brands What brands are marketed by Nestlé around the world
• Collection
Syandra Sari 17
Experiment (2)
Building Indonesian-English Parallel Corpus
English corpus Indonesian
corpus
Machine Translation
INDONESIAN-ENGLISH PARALLEL CORPUS
BUILDING PARALLEL CORPUS
Experiment (3)
Indonesian Queries
English Queries
IRS Indonesian-English Parallel Corpus
Syandra Sari 19
Experiment (4)
Queries were also translated using:
• Bilingual dictionary
• Machine translations:
– Toggletext – Transtool
• Pseudo-translation based on parallel
corpus
Result and Analysis
• The Mean Average Precision (MAP) of
– monolingual English queries,
– the Indonesian queries translated using
• bilingual dictionary
• Toggletext machine translation
Technique MAP
Monolingual 0,3242
Dictionary 0,1685 (-48,02%) M. Translation
(Toggletext)
0,2750 (-15,18%)
Syandra Sari 21
Result and Analysis
The Mean Average Precision (MAP) of the Indonesian queries translated using parallel corpus : pseudo-
translation and mutual information technique
Technique MAP
Parallel-Pseudo Translation (PT) 0.2245 (-30.75%) Parallel-Mutual Information(MI) 0.1085 (-66.53%) Parallel- Mutual Information with
query expansion (MI-QE) 0.1357 (-58.14%)
Result and Analysis
• English word from the
highest value of MI
(first rank)
• 110 word from 260 word
Indonesian word
English word from MI (first rank)
merek trademark
keadaan circumstance pengangguran unemployment
visa visa
Syandra Sari 23
Result &
Analysis
• English word from 2nd to 5th value of MI
• 51 word
Indonesian word
English word from MI 1st rank to 5th rank
Correct English word
Rank
africa
afrikan, african,
kwazulu, AFRICA, gatsha
AFRICA 4
perawatan
aftercare, surgery, TREATMENT, carefree, arthritis
TREAT MENT
3
bijaksana
discreet, PRUDENT, wiser, wisdom,
indiscreet
PRU DENT
2
main
romp, overplay, PLAY playground, nut
PLAY 3
Result and Analysis
• Example of Indonesian words get wrong
English words as translation
• 44 word
Indonesian word
English word
from MI (first to fifth rank)
pengaruh
relentless,
unsuspected, mindful, favor, abscond
produksi
latch, lug, lubricant,
rapacity, sewage
Syandra Sari 25
Result and Analysis
• There are 55 Indonesian words did not get best / correct English words because
of the stemming process such as
:1. Some Indonesian words get antonym word in English
• Example: “adil” was translated into “unjustice” using MI
• (“unjustice” is antonym of “justice”).
2. Some Indonesian words get related word in English.
Example: “matahari” was translated into “sunlight”, “solar”
using MI (“sunlight”, “solar” are related to “sun”).
Result and Analysis
3. Some Indonesian words don’t get the best translation.
Example: “putri” was translated into “daughter” using MI the best translation for “putri” in English is “princess.
4. Some Indonesian words get English word that is a translation for other variation of the Indonesian
words.
Syandra Sari 27
Result and Analysis
• There are 9 phrases but only 3 phrases get correct translation in English
Indonesian phrase In English Using Mutual Information
bahan bakar fuel firewood explosive
energi atom atomic energy atomic energy gerhana matahari solar eclipse solar eclipse
gempa bumi earthquake quake
lintas alam cross country undergo straightaway perdana menteri prime ministry morihiro
sidang pengadilan trial unjustice convocation
uang sekolah tuition tuition
undang-undang dasar constitution basic invitation
Conclusion
• We find that mutual information
– could rank the words in good order based on value of mutual information to get the best translation, however sometimes it gives the wrong or incorrect translation.
• Machine translation techniques are the best translation method so far
• Based on the result of our evaluation, there is still
Syandra Sari 29
Future Research
• explore better techniques in finding bilingual word pairs than mutual
information technique.
• Apply better query expansion technique to
improve the result