STEMMING - THE SELECTION OF NATURAL LANGUAGE INDEX TERMS

THE SELECTION OF NATURAL LANGUAGE INDEX TERMS

5. STEMMING

Another technique that may improve the quality of automatic indexing is stemming. Stemming or conflating words is the process of reducing the morphological variants of the words to their stem or root (e.g., mapping singular and plural forms of a same word to a single stem). The program that executes the mapping is called a stemmer. It is assumed that words with the same stem are semantically related and have the same meaning to the user of the text.

Stemming in the field of information retrieval aims at improving the match between the index terms of query and document text. The chances of matching increase when the index terms are reduced to their word stems.

Stemming, thus, is a recall-enhancing device to broaden an index term in a text search (Salton, 1986). Additionally, stemming reduces the number of index terms by mapping the morphological variants to a standard form.

Consequently, the size of the text representation decreases, which is beneficial in terms of storage.

There are four major automatic approaches to stemming.

1. The table lookup method is the simplest method and requires the terms and their stems to be stored in a table or a machine-readable dictionary (Frakes, 1992). Stemming is done via lookups in the table. The advantage of this method is that the stemming results are generally correct.

However, the table becomes large, when it takes into account terms in standard language and possibly terms in the specialized subject domain of the text corpus. Large tables require large storage spaces and efficient search algorithms (e.g., binary search tree, hash table).

2. Affix removal algorithms are most commonly used and remove suffixes and/or prefixes from terms leaving a stem (Frakes, 1992). These algorithms also transform the resultant stem (e.g., ‘a’ to ’u’ in “ran” to

“run”; cf. in Dutch: "ie" to "oo" in "liep" to "loop"). The Lovins stemmer (1968) removes suffixes using a longest match algorithm. It removes the longest possible string of characters from a word according to a set of rules. This process is repeated until no more characters can be removed.

Even after all characters have been removed, stems may not be correctly conflated. Then, linguistic knowledge is employed to recode the stem.

The Porter algorithm (Porter, 1980) removes affixes by applying a set of rules. The rules also account for transformations of the stem. Affix removal algorithms can become quite ingenious and employ many inferences from linguistic knowledge about the internal structure of words for generating the correct reductions (Krovetz, 1993). The knowledge that the affix removal algorithms employ is language dependent.

3. Letter successor variety stemmers (Hafer & Weiss, 1974) learn morphemes from a large body of example words. They use the frequencies of letter sequences in a corpus of texts as the basis of stemming. For each possible begin sequence of letters of a word the number of variant successor letters (distinct letters) in the corpus is computed. The successor variety tends to decrease from left to right, while at boundaries of morphemes (e.g., after an affix) the successor variety rises. By calculating the set of successor varieties for a test word and noting the peaks, we can detect the morphemes of a word. When at the end of a word the successor variety becomes very low, suffixes are detected by considering the word and the words in the corpus in reverse letter order. Heuristics determine whether a found morpheme is a stem or an affix. When the morpheme matches other corpus words, it is probably a stem. When the segment occurs as first (last) part in a number of different words, it is probably a prefix (suffix). The advantage of this method is that it can adapt to changing text collections and languages, but the method does not distinguish inflectional from derivational affixes.

4. Finally, the n-gram method conflates terms based on the number ofn- grams they share. An n-gram is a sequence of n consecutive letters.

Adamson and Boreham (1974) compute the number of unique matching bigrams in pairs of words (computed with the Dice coefficient¹). A bigram is a pair of consecutive letters. Xu and Croft (1998) use trigrams.

Terms that are strongly related by the number of shared n-grams are clustered into groups of related words. Heuristics help in detecting the root form (see above), or special cluster algorithms might be useful for this task (e.g., cluster algorithms based on the selection of representation objects, cf. chapter 8). Again this method does not distinguish between inflectional and derivational affixes.

Many stemmers have been developed for the English language (overview see Frakes, 1992). The two most common stemmers for English are the Lovins stemmer (Lovins, 1968) and the Porter stemmer (Porter, 1980).

Kraaij and Pohlmann (1996) have used the Porter algorithm to develop a stemmer for Dutch and have developed an additional inflectional and derivational stemmer using a computer readable dictionary of Dutch words.

In Dutch nominal compounds are generally formed by concatenating two (or more) words to create a single orthographic word (e.g., “fiets” + “wiel” =

“fietswiel” (“bicycle” + “wheel” = “bicycle wheel”)). Stemmers of the Dutch language are extended with a compound analyzer (word splitter) (Vosse, 1994 cited in Kraaij & Pohlmann, 1996). This tool aims at splitting a compound into its components (stems) by applying word combination rules and a lexicon.

Automatic stemming can result in overstemming and understemming.

The former refers to the case when too much of the term is removed, which causes unrelated terms to be conflated to the same stem. The latter refers to the removal of too little from a term, which prevents related terms from being conflated. Stemming is useful when the morphology of a language is rich (e.g., Hungarian or Hebrew) or when the text to be indexed is short (Krovetz, 1993). Removal of inflectional morphemes usually has little impact upon a word’s meaning and thus can be safely done (e.g., mapping singular and plural of a same word to a single stem). Removal of derivational morphemes may change a word’s meaning. Stemming has been evaluated from the viewpoint of retrieval effectiveness (overview of the studies regarding the English language, see Frakes, 1992 and Hull, 1996;

regarding the Dutch language, see Kraaij & Pohlmann, 1996). It is generally agreed upon that stemming either has a positive or no effect on retrieval effectiveness. Splitting Dutch compound nouns has been proven effective to increase retrieval performance.

Dalam dokumen AND ABSTRACTING (Halaman 98-101)