ALTERNATIVE PROCEDURES FOR SELECTING INDEX TERMS

THE SELECTION OF NATURAL LANGUAGE INDEX TERMS

8. ALTERNATIVE PROCEDURES FOR SELECTING INDEX TERMS

8. ALTERNATIVE PROCEDURES FOR

extent of topic coverage represented by a specific term in the text can be approximated by its frequency of occurrence.

The number of topic words is very large and the probability of being selected is small, so the process of topic term generation is a Poisson process. But, the mean of this Poisson process depends upon the degree of topic coverage associated with the topic term. The document collection can be broken down into subclasses regarding the topic coverage of a specific term, and the assumption is made that a different Poisson distribution applies to the given term in each subclass with different parameters. The distribution of the text token i within each class C^j is governed by a single Poisson processwith a mean of λ ^j. This is often computed as the average number of

occurrences of the text token iper text in this class and represents the extent of the topic associated with i.

The occurrence of a certain word across all texts is than divided according to aMultiple Poisson (nP) distribution, of which the number of components is equal to the number of classes. A Multiple Poisson distribution is a mixture of Poisson distributions with different means (λ^j).

Thus, the distribution of a certain text term i in texts within the whole collection is governed by the sum of Poisson distributions, one for each class of topic coverage. The frequency of occurrence of a text word is then described by a sum of Poisson distributions. Each summand in this sum is an independent single Poisson distribution that describes the frequency of occurrence within a subset of texts that belong to the same level of topic coverage related to the text term. The probability that a randomly chosen document text D^mcontainskoccurrences of a certain term iis given by:

(12) where

j = class of topic coverage related to the term i

λ j = average extent of topic coverage related to the term iwithin the class Cj π j= probability that the text belongs to a class Cjand given Σ jπ j, = 1.

The validity of the Multiple Poisson ( nP) model has been tested for single words (Bookstein & Swanson, 1974; Harter, 1975a, 1975b; Losee, 1988;

Srinivasan, 1990). The study of Margulis (1992) indicates that over 70% of frequently occurring words and word stems indeed behave according to the Multiple Poisson model. The proportion of words that are Multiple Poisson distributed depends on the collection size, text length, and the frequency of

individual words. Most of the words are distributed according to the mixture of relatively few single Poisson distributions (two, three or four).

Forindexing purposes, it is important to compute for each term the extent of topic coverage in order to select the term as index term or to appropriately weigh the term. So, the ultimate aim of the Multiple Poisson (nP) model of word distribution is that the division of texts in classes gives insight into the content of the texts based on a number of word occurrences. Assuming that terms in a body of text are generated by a Poisson process, allows measuring the probability that a text has a given number of occurrences given an average frequency of occurrences of the term in a class about the topic in a reference or example collection. The probability that a text with k occurrences of the index term i belongs to a certain class of topic coverage (Cx) with a mean of λ _xregarding the use of index ican be computed by (cf.

12) (cf. van Rijsbergen, 1979, p. 28 ff.):

(13) For each class of topic coverage regarding index term i, this probability can be computed and used as a criterion for class membership (and consequently as a criterion for selection of the index term) or used as a probabilistic term weight. The difficulty in using this approach lies in the estimation of the parameters, especially in estimating the means of each Poisson distribution. A common technique estimates the parameters of a two-Poisson distribution for each term directly from the distribution of within-text frequencies in the class of example texts that is about the topic term and in the class of example texts that does not bear upon the topic term (Robertson, van Rijsbergen, & Porter, 1981). Estimation of the parameters needs further research (cf. Losee, 1988; Robertson &Walker, 1994).

8.2 The Role of Discourse Structure

Knowledge about discourse structures and their signaling linguistic phenomenacan help in selecting terms from a text that are reflective of its content (Hahn, 1989; Lewis & Sparck Jones, 1996). The idea can be traced back to Luhn (1957). There are timid attempts to incorporate knowledge about discourse structures into text indexing. Dennis (1967) determines the importance of a word based upon its frequency of occurrence within a text paragraph and across preceding and succeeding paragraphs. The tendency of

occurrences of a word to clump is still considered useful in selecting terms (Bookstein, Klein, & Raita, 1998). Index term selection and weighting can be determined by the structural position of the term in the text (e.g., within title, within summary, in a first paragraph) (Bernstein & Williamson, 1984;

Jonák, 1984; Wade, Willett, & Bawden, 1989; Liddy & Myaeng, 1993;

Wilkinson, 1994; Burnett, Fisher, & Jones, 1996; Burger, Aberdeen, &

Palmer, 1997; Fitzpatrick, Dent, & Promhouse, 1997). There is also much research into structural decomposition of texts according to different themes (Salton & Buckley, 1991; Hearst & Plaunt, 1993; Salton, Allan, Buckley, &

Singhal, 1994; Salton, Singhal, Mitra, & Buckley, 1997), which might be useful for identifying important topic terms in texts.

9. SELECTION OF NATURAL LANGUAGE INDEX

Dalam dokumen AND ABSTRACTING (Halaman 115-118)