• Tidak ada hasil yang ditemukan

CHAPTER 3

Lexical Chains as Document Features

In this chapter, we describe our algorithm for computing lexical chains and describe how lex- ical chains can be used as document features. We also describe two different chain selection schemes.

Not all words in a document can be used to form lexical chains. At a bare minimum, stop words must be excluded as they will be meaningless. We refer to those set of words in a document which are used to compute lexical chains as candidate words. We first run the WSD algorithm on the input document as a preprocessing step. This will markup each word (other than stopwords) in a document using its part of speech and sense number from WordNet. The resulting words are filtered to obtain the set of candidate words which are used to form lexical chains. This filtering can be done in various ways to obtain different variants of lexical chains.

Silber and McCoy [11] suggested using only nouns and compound nouns present in a doc- ument, for computing lexical chains. This is because most of the concepts are referred to by nouns. The other researchers are silent on the issue but seem to prefer only nouns. We wanted to test this hypothesis and created 3 categories of candidate words - N (Nouns only), NV (Nouns+Verbs), and NVA (Nouns+Verbs+Adjectives+Adverbs). We later demonstrate em- pirically that Nouns alone are adequate to represent the information contained in the documents.

Lexical Chains can be computed at various granularities - across sentences, paragraphs or documents. In general, to compute lexical chains, each candidate word in a unit of text is compared with each of the lexical chains identified so far. If a candidate word has a ‘cohesive relation’ with the words in the chain it is added to the chain. On the other hand, if a candidate word is not related to any of the chains, a new chain is created and the candidate word is added to it.

Each lexical chain is identified by a unique numeric identifier and contains a set of seman- tically related words. Each word in a chain is represented as a 4-tuple<term, pos, sense, rel>, where ‘term’ is the actual word, ‘pos’ is the part-of-speech of the term, ‘sense’ is the WordNet sense number and ‘rel’ is the relation of this word to the chain. In theory any of the WordNet relations can be used to relate candidate words to the chains.

Hirst and St-Onge [12] suggested giving weights to the various relations with identity and

synonymy relations having the highest weights, immediate hypernymy and hyponymy having intermediate weights and sibling hypernymy or hyponymy having the lowest weight. As per this scheme, a word is added to the chain with which it is related with the highest weight.

According to Morris and Hirst [6], using relations other than identity and synonymy will imply a transitive relation between words and chains,i.e., A is related to B and B is related to C, and therefore A is related to C. Thus including hypernyms and hyponyms relations in a chain will result in the chains representing composite topics. This also reduces the granularity of the topic contained in each chain. For example, when hypernyms are used to compute lexical chains, the words ‘motor vehicle’, ‘car’ and ‘bike’ will result in a single chain since ‘motor vehicle’ is a hypernym of both ‘car’ and ‘bike’ when in reality they are distinct but related concepts.

Thus, we feel that using only the identity and synonymy relations will result in crisp chains, each of which represents a single topic. We feel that for generating feature vectors it will be better to use only the identity and synonymy relations to form the chains. This hypothesis is tested and validated in the next chapter where we compare the performance of lexical chains computed using other WordNet relations.

3.1.1 Global Set

One of the distinguishing features of our algorithm is that we compute lexical chains across documents. This requires us to store the identified chains, as well as provide a mechanism to create new chains. We facilitate this by introducing a data structure which we refer to asGlobal Set. The global set is a set of lexical chains which appear in at least one document. The global set maintains the following information:

• a list of documents processed

• the chains which appear in at least one document

• the number of times a chain occurs in a document

This information is useful when computing feature vectors. We have also implemented a serialisation mechanism on the global set data structure. This allows us to save and reuse a global set allowing us to run the lexical chaining algorithm in an incremental manner on many sets of documents.

3.1.2 Algorithm

Our algorithm to compute lexical chains is shown as Algorithm 1. The algorithm maintains a Global Set. The global set can either be initialised to a NULL set or can be bootstrapped with a previously computed global set. The documents are pre-processed using the Extended Lesk Algorithm. The disambiguated words are then filtered to obtain the candidate words for the documents.

Algorithm 1Computing Lexical Chains

1: Maintain a global set of lexical chains

2: foreach documentdo

3: foreach candidate word in documentdo

4: Identify lexical chains in the global set with which the word is related

5: ifNo chains are identifiedthen

6: Create a new lexical chain for this word and insert it into the global set

7: end if

8: Add the candidate word to the identified/created lexical chain

9: end for

10: end for

Each candidate word is compared with each of the chains in the global set. If the word is related to a chain, then it is added to the chain. If no such chains are identified, then a new chain is created, the word is added to the chain and the new chain is inserted into the global set.

This algorithm exhaustively computes all possible lexical chains for each document. We feel that it will be better to do chain selection on this global set of all possible chains, than to do it locally at each document. We use only the identity and synonymy relations to compute the

chains. A word has a identity or synonymy relation with another word, only if both the words occur in the same synset in WordNet.

Dokumen terkait