Classical Weighting Functions - INDEX TERM WEIGHTING

THE SELECTION OF NATURAL LANGUAGE INDEX TERMS

7. INDEX TERM WEIGHTING

7.2 Classical Weighting Functions

The law of Zipf

It was Luhn (1957) who discovered that distribution patterns of words could give significant information about the property of being content bearing. He noted that high-frequency words tended to be common, non- content bearing words. He also recognized that one or two occurrences of a word in a relatively long text could not be taken significant in defining the subject matter. Earlier, Zipf (1949) plotted the logarithm of the frequency of a term in a body of texts against rank (highest frequency term has rank 1, second highest frequency term has rank 2, etc.). For a large body of text of

“well-written English”, the resulting curve is nearly a straight line. Thus, the constant rank-frequency law of Zipf describes the occurrence characteristics of the vocabulary, when the distinct words are arranged in decreasing order of their log frequency of occurrence:

log (frequency) . rank = constant (1) This law expresses that the product of the logarithm of the frequency of each term and its rank is approximately constant. Other languages or other writing styles may be expressed by other non-linear functions. But, there is a relationship between the Zipfian curve and Luhn’s concept of where the significant words are. Words with low significance are at both tails of the distribution. Therefor, Luhn suggested using the words in the middle of the frequency range. These findings are the basis of a number of classical weighting functions.

Term frequency

It is assumed that the degree of treatment of a subject in a text is reflected by the frequency of occurrence in the text of terms naming that concept. A writer normally repeats certain words as he or she advances or varies the arguments and as he or she elaborates on an aspect of the subject. This means of emphasis is taken as indicator of significance. A content term that occurs frequently in a text is more important in the text than an infrequent term. The frequency of occurrence of a content word is used to indicate term importance for content representation (Luhn, 1957; Baxendale, 1958; Salton,

1975a, p. 4 ff.; Salton & McGill, 1983, p. 59 ff.; Salton, 1989, p. 279).

The term frequency (tf) measures the frequency of occurrence of an index term in the document text (Salton & Buckley, 1988):

tfi =frequency of occurrence of the index termiin the text. (2) The occurrence of a rare term in a short text is more significant than its occurrence in a long text. The logarithmic term frequency reduces the importance of the raw term frequency in those collections with wide varying text length (cf. length normalization below) (Sparck Jones, 1973; Salton &

Buckley, 1988; Lee, 1995):

log(tfi) = In(tfi) =

common logarithm of frequency of occurrence of index term i

in the text (3)

natural logarithm of frequency of occurrence of index term i

in the text. (4)

Index terms with a high term frequency are good at representing text content, especially in long texts and in texts containing many significant or technical terms. For short texts, term frequency information is negligible (most of the terms occur once or twice) or even misleading. Anaphoric constructs and synonyms in the text hide the true term frequency (Bonzi &

Liddy, 1989; Smeaton, 1992). It is assumed that high frequency content- bearing terms represent the main topics of the text. When an index term occurs with a frequency higher than one would expect in a certain passage of the text, it possibly represents a subtopic of the text (Hearst & Plaunt, 1993).

Inverse document frequency

After elimination of stopwords, a text still contains many common words that are poor indicators of its content. Common words tend to occur in numerous texts in a collection and often seem randomly distributed over all texts. The more texts a term occurs in, the less important it may be. For instance, the term “computer” is not a good index term for a document collection in computing, no matter what its frequency of occurrence in a text of the collection. The more rarely a term occurs in individual texts the more discriminating that term is. Therefor, the weight of a term should be inversely related to the number of document texts in which the term occurs, or to the document frequency of the term (Sparck Jones, 1972; Salton &

Yang, 1973; Salton, 1975a, p. 4 ff.; Salton & McGill, 1983, p. 63; Salton, 1989, p. 279 ff.; Greiff, 1998). An inverse document frequency factor ( idf factor), is commonly used to incorporate this effect. The logarithm decreases the effect of the inverse document frequency factor. The inverse document

frequency (idf) weight is commonly computed as (Sparck Jones, 1973;

Salton & Buckley, 1988; Lee, 1995):

(5 ) where

log = common logarithm (an alternative is ln = natural logarithm) N=

ni=

number of documents in the reference collection

number of documents in the reference collection having index term i.

An inverse document frequency weight is collection dependent. It is usually obtained from a collection analysis prior to the actual indexing of the documents and is based on the distribution of the term in a reference collection. The reference collection is customarily the complete text corpus to be indexed. It may also be a general corpus that reflects a broad range of texts (e.g., the Brown corpus in English) (cf. Evans et al., 1991). When the reference collection changes over time, the weight of an index term should be recomputed each time a document is added to or deleted from the collection. This is not only unpractical, but results in an unstable text representation. So, the use of the inverse document frequency factor based on a changing reference collection is discouraged (Salton & Buckley, 1988).

Other types of reference collections are possible. For instance, Hearst and Plaunt (1993) consider the complete text of a document as the reference frame for computing the weight of index terms of small text segments (3-5 lines) in order to discriminate the subtopics of these segments.

The inverse document frequency factor is important in identifying content bearing index terms in texts (Sparck Jones, 1973). Sometimes, index terms with a low inverse document frequency value are eliminated as stopwords (e.g., Smeaton, O’Donnell, & Kelledy, 1995).

Product of the term and the inverse document frequency

In judging the value of a term for purposes of content representation, two different statistical criteria come into consideration. A term appearing often in the text is assumed to carry more importance for content representation than a rarely occurring term. On the other hand, if that same term occurs as well in many other documents of the collection, the term is possibly not as valuable as other terms that occur rarely in the remaining documents. This suggests that the specificity of a given term as applied to a given text can be measured by a combination of its frequency of occurrence inside that text

(the term frequency or tf) and an inverse function of the number of documents in the collection to which it is assigned (the inverse document frequency or idf). The best terms will be those occurring frequently inside the text, but rarely in the other texts of the document collection. These findings are the basis for a very popular term weighting function that determines the product of the term frequency and the inverse document frequency(tf xidf) of the index term (Sparck Jones, 1973; Salton, 1975a, p.

26 ff,; Salton & Buckley, 1988; Salton, 1989, p. 280 ff.; Harman, 1986 cited in Harman, 1992a). Usually, the product of the raw term frequency (2) and the common logarithm ofthe inverse document frequency (5) is computed:

(6) Length normalization

Document texts have different sizes. Long and verbose texts usually use the same terms repeatedly. As a result, the term frequency factors are large for long texts and small for short ones obscuring the real term importance.

Also, long texts have numerous different terms. This increases the number of word matches between a query and a long text, increasing the chances of retrieval over shorter texts. To compensate for these effects, variations in length can be normalized. Length normalization is usually incorporated in weighting functions²and it mostly normalizes the term frequency factor in a weighting function. The following describes the most important length normalization functions.

The term frequency of an index term i is sometimes normalized by dividing the term frequency (2) by the maximum frequency that a term occurs in the text:

tfi

maxtfj

(7) where

tfj = term frequency of an index termjin the text

j= 1 .. n(n= number of distinct index terms in the text).

The result of the above normalization is a term frequency weight that lies between 0 and 1. In a popular variant the normalized term frequency of (7) is weighted by 0.5 to decrease the difference in weights of terms that occur infrequently and terms that occur frequently. The weighted term frequency is further altered to lie between 0.5 and 1 (addition of 0.5). This variant is

called the augmented normalized term frequency (Salton & Buckley, 1988;

Lee, 1995):

(8) A common way of length normalization is the cosine normalization where each term weight is divided by a factor representing Euclidean vector length (Salton & Buckley, 1988). The length of the vector is computed with all distinct indexable words. When the weight of the index term i is computed with the term frequency (tf) (2), the normalized term weight of index termiis:

(9) where

tfj = term frequency of an index termj

j = 1 ..n (n = number of distinct index terms in the text).

Cosine length normalization can be applied to other weighting functions, such as the product of the term frequency and the inverse document frequency ( tf x idf) (6), which yields the normalized term weight for index termi(cf. 9):

(10) Length normalization is beneficial for certain texts. It has been proven successful for indexing a document collection with texts of varying length (Sparck Jones, 1973; Salton & Buckley, 1988), especially when long texts are the result of verbosity.

Long texts have other causes than solely verbosity. One of them is the presence of multiple topics. In this case, the cosine normalization (9 and 10) causes the weight of a topic term to be decreased by the weight of non- relevant terms, i.e., terms that discuss the other topics. In a retrieval

environment, this situation decreases the changes of retrieving documents that deal with multiple topics, when only one of the topics is specified in the query (Lee, 1995). The augmented normalized term frequency (8) alleviates this effect. This is because, the normalizing factor of this method, namely the maximum frequency of term occurrence in the text usually has a modest value when the text deals with multiple topics. Another reason for long texts is that they contain much information about a specific topic. In a retrieval environment long documents are sometimes preferred over shorter ones that treat the same topic (Singhal, Salton, Mitra, & Buckley, 1996). Length normalization is a way of penalizing the term weights for longer documents, thereby reducing, if not removing completely, the advantage of long documents in retrieval (Strzalkowski, 1994). Pivoted length normalization increases or decreases the impact of a length normalization factor (Singhal, Buckley, & Mitra, 1996). Initial training queries retrieve an initial set of documents and the probabilities of relevance and of retrieval are plotted against text length. Pivoted normalization makes the normalization function weaker or stronger by reducing the deviation in the retrieval probabilities from the likelihood of relevance.

Term discrimination value

The term discrimination model (Salton, Yang, & Yu, 1975; Salton &

McGill, 1983, p. 66 ff.; Salton, 1989, p. 281 ff.) assumes that the most useful terms for content identification of natural language texts are those capable of distinguishing the documents of a collection from each other. The term discrimination value measures the degree to which the use of the term will help distinguishing the documents from each other. For this purpose, the concept ofconnectivity is used. Bad index terms are the ones that increase the degree of connectivity between texts, while good index terms decrease it.

The term discrimination value of an index term is computed as the difference in connectivity between the texts, before and after adding the index term.

The simplest way to compute the degree of connectivity is by taking the average of all mutual similarities between the text pairs in the collection.

Similarities between the texts are obtained with similarity functions applied upon their term vectors (cf. Jones & Furnas, 1987).

The term discrimination value is collection dependent. The value is comparable with the inverse document frequency weight and may replace the latter in a tf x idfweighting function (6) (Salton, Yang, & Yu, 1975).

However, while very frequent terms tend to have low weights for either function, discrimination values for medium frequency terms tend to be higher than for low frequency terms (Sparck Jones, 1973). The term

discrimination model has been criticized, because it especially discriminates a document from all other documents of the collection (Salton, 1989, p.

284). It is possible that many other relevant documents regarding the topic expressed by an index term are present in the collection.

Term relevance weights

A term relevance weight of an index term is learned based upon its probability of occurrence in relevant and non-relevant documents (Maron &

Kuhns, 1960; Salton, 1989, p. 284 ff.). The relevant and non-relevant set are assumed to be representative for the complete corpus. Commonly, term relevance weights are computed on the basis of relevance information from a number of queries formulated with the index term. The term relevance weights are based on term occurrence characteristics in the relevant and non- relevant texts. For example, terms occurring mostly in texts identified as relevant to the query receive higher weights than terms occurring in the non- relevant texts. A number of different relevance weighting functions have been formulated (Bookstein & Swanson, 1975; Robertson & Sparck Jones, 1976; Sparck Jones, 1979; Salton, 1989, p. 284 ff.; Fuhr & Buckley, 1991).

A preferred function for the weight of an index term i is (Robertson &

Sparck Jones, 1976; Sparck Jones, 1979):

(1 1) where

N= the number of texts in the training set R= the number of relevant texts for the query ni= the number of texts having index termi ri= the number of relevant texts having index termi.

In real applications, it is difficult to have enough relevance information for each index term available in order to estimate the required probabilities (cf. Croft & Harper, 1979; Robertson & Walker, 1997).

Phrase weighting

It has been shown that phrases give potentially better coverage of text content than single-word terms. When selecting phrases from a text, not all

phrases equally define its content. A text can contain very specific concepts that are of no importance to include in its representation. Phrase weighting (including proper name phrase weighting) helps in deciding which phrases to include in the representation. Phrase weighting also contributes to a better discrimination of phrasal terms when matching query and text representations in a retrieval process.

Because of their lower frequency and different distribution characteristics, weighting of phrases can differ from single-word weighting (Fuhr, 1992; Lewis & Sparck Jones, 1996). However, the methods currently in use employ in one way or another classical weighting functions for single words, It is generally agreed that phrase weighting needs further investigation (Fagan, 1989; Croft et al., 1991; Buckley, 1993; Strzalkowski et al., 1997).

When computing a phrase weight, a phrase can be considered as a separate concept or as a set of words (Croft et al., 1991).

1. When the phrase is considered as a separate concept, its weight is independent of the weight of its composing components. This weight can be proportional with the number of times the phrase occurs in the text (term frequency tf ) (2) and/or inversely proportional with the number of document texts in which the phrase occurs (inverse document frequency idf) (5) (Dillon & Gray, 1983; Croft et al., 1991; Strzalkowski,

1994). In order to obtain accurate weights, such a strategy requires a correct normalization of the phrases to a standard form and a resolution of anaphors³(Smeaton, 1986).

2. The weight of a phrase can be the combination of the weights of its composing single words. Then, the weight is computed as the average weight of the components (Salton, Yang, & Yu, 1975; Fagan, 1989;

Croft et al., 1991; Evans et al., 1991), as the product of the component weights (Croft et al., 1991), or as the highest weight amongst the component weights (Croft et al., 1991). The weight of a phrase component is usually computed as the product of the term frequency (tf) (2) and the inverse document frequency (idf) (5) of the individual word.

Although the weight of a single component may influence the weight of the phrase, this strategy makes it possible that a phrase weight usually does not differ strongly from the weights of its components (Fagan, 1989).

3. Jones, Gassie, and Radhakrishnan (1990) employ a combined approach and weigh phrases proportional to the frequency of occurrence of the complete phrase and to the frequency of occurrence of its composing words.

8. ALTERNATIVE PROCEDURES FOR

Dalam dokumen AND ABSTRACTING (Halaman 107-115)