Transformation of Words by Stemming - Attribute Content and Values

Attribute Content and Values

4.3.1 Transformation of Words by Stemming

a. If a potential stem ends with a consonant other than S, followed by the letter S, then delete the S.

STEMS→^STEM, but

STRESS→^STRESS

b. If a word ends in ES, drop the terminal S.

PLACES→^PLACE

LIKES→^LIKE

(This rule has difficulty with plural words of Greek origin for which the singular ends in IS, as THESESshould go to THESIS.) Other rules could be written specifically to deal with these Greek words. Examples:

INDICES → ^INDICE (This would be awkward unless further processing were used to recognize the Latin morphology where the plural form is INDEX.)

SYNTHESES→^SYNTHESE(Similar error, this time with Greek morphology.)

c. As endings, IEV→^IEFand METR→^METER. The word ISOMET-

RICwould first become ISOMETRand then ISOMETER, and BELIEVfrom the earlier operation would be transformed to BELIEF. This works well with English words, but not with Russian, such as PROKOFIEV.

d. If a word ends in ING, delete the INGunless the remaining word after deletion consists only of one letter or of TH. Thus:

THINKING→^THINK

SINGING→^SING

4.3 Transformations of Values

83

Table 4.2

Endings in Lovins’ Stemming Algorithm^a

Length Ending Condition code

11 ALISTICALLY B

ARIZABILITY A

IZATIONALLY B

10 ANTIALNESS A

ARISATIONS A

ARIZATIONS A

ENTIALNESS A

9 ALLICALLY C

ANTANEOUS A

4 ABLE A

ABLY A

AGES B

ALLY B

3 ISM B

1 E A

aThis is a sample of word endings or suffixes arranged in descending order of length and alphabetically within a length group. This way, the longest suffix match is made first.

SING→^SING(No change.)

THING→^THING(No change.)

PRECEDING→^PRECED (Not a meaningful word; needs further processing.)

SLING→^SL(Not a meaningful word.)

Since that the computed root of PRECEDINGwould be the non-word

PRECED, the Ecould be restored by another rule calling for adding Eif, after stemming, a word ends in ET, ED, or ES, but then MEETING→^MEET→^MEETE which we do not want but could eliminate with a rule about double vowels preceding the terminal consonant. It is easy to see how rules can proliferate.

e. If a word ends with ED, preceded by a consonant, delete the ED

unless this leaves only a single letter.

ENDED→^END

RED→^RED

PROCEED→^PROCEED(EDnot preceded by a consonant.)

PROCEEDED→^PROCEED

f. If, after removal of an ending, a word now ends in A DOUBLE CONSANT, E.G. BB, DD, OR TT, remove one of the doubled letters.

Thus, EMBEDDED → ^EMBEDD by removal of an ending, then

EMBEDD→^EMBEDby this rule.

g. If a word ends in ION, remove the IONunless the remaining word has two or fewer letters. If the last letter of the stem is a consonant and the letter preceding it is a vowel, add an E.

DIRECTION→^DIRECT

POLLUTION→^POLLUTE

PLANTATION→^PLANTATE(which another rule would reduce to PLANT)

ZION→^ZION

SCION→^SCION

ANION→^ANION

CATION→^CATE(Error.)

Note that CATIONis a made-up word, the combination of CATHODE

and ION, and so does not have the usual kind of semantic root.

A workable stemming program would probably require at least 10–20 rules (hundreds are possible), including large numbers of provisions for spe- cial cases and irregular words. The program would probably also have logic for iterative application of some of the rules, such as transforming DIREC-

TIONSto DIRECTIONto DIRECT. Programs to remove prefixes are possible, but tend to be not very productive for search purposes. ACCEPTS, DECEP-

TION, and RECEPTION can be easily stemmed to ACCEPT, DECEPT, and

RECEPT, respectively. Then, removal of the prefixes would yield a common stem of CEPT. All three words are derived from a Latin root meaning to take, but the modern meanings are lost in these transformations. A reminder: the roots do not have to be meaningful words for them to be useful.

84

4 Attribute Content and Values

Ch004.qxd 11/20/2006 9:54 AM Page 84

2. Paice and Husk method—This is a more modern stemming procedure which “is iterative, and uses just one table of rules; each [of which] may specify either deletion or replacement of an ending” (Paice, 1990, p. 56). A simple example is a rule written as SEI3Y>, which states that if the ending IES(written backward in the rule statement for easier detection by a program) is found, replace the last three letters with the letter Y. A later rule states that if the detected string (form is the actual Paice–Husk term) starts with a vowel, then at least two letters must remain after stemming, e.g., by rule GNI3 we convert OWING→^OW, but SING→^SING.

There have been many stemming algorithms, none perfect. In general, the more the rules of the types illustrated, the greater the probability of correctly stemming a word. But, inevitably, mistakes will be made, as often as not owing to the vagaries of natural language, especially English. For example, as noted above, English words taken more or less directly from Greek, like THESES, do not follow the usual rules for removing terminal ES. Using English rules, we would probably convert THESESinto THESE, instead of THESIS. We can readily transform

DIRECTION to DIRECT and POLLUTION to POLLUTE (add the E after stemming because the next-to-last letter is a vowel), but the same rule gives the transformation PLANTATION→^PLANTATE, and, as above, PLANTATE→^PLANT. Also, the more rules, the more the expensive it is to do stemming, so perfection is rarely sought. Paice states (p. 59) “there are people who tinker with rules, but they do so in an ad hoc fashion; no systematic work on stemmer optimization seems to have appeared in the literature.”

Dalam dokumen Text Information Retrieval Systems (Halaman 101-104)