THESAURUS TERMS - THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS

THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS

3. THESAURUS TERMS

A first and common form of vocabulary control is the assignment of index terms as listed and described in athesaurus (Harter, 1986, p. 42 ff.).

The thesaurus offers a precise vocabulary to describe a document text. The original terms of the text are transformed to more uniform naming or more general concepts. A thesaurus for automatic indexing has the form of a machine-readable dictionary (MRD).

A thesaurus provides a grouping or classification of the terms used in a given topic area into classes known as thesaurus classes. The terms of a thesaurus class have a certain semantic relatedness due to their inherent meanings (Salton, 1975b, p. 461 ff.). Each class has a representative term, called a thesaurus class term. The thesaurus is used to replace a text’s term by its thesaurus class term. Class membership can be weighted (Mc Cune, Tong, Dean, & Shapiro, 1985).

A thesaurus portraits the semantic relationships that hold between the terms when they refer to different aspects of a common concept or domain (Fox, 1980; Wang & Vandendorpe, 1985; Fagan, 1989). The main relationships are the ones that define synonyms or that broaden or narrow the meaning of a term. Other kinds of semantic relationships are possible.

Thesaurus classes have a similar function as ontologies used in natural language processing. Whereas philosophical work on ontology traditionally concerns questions about the nature of being and existence, in artificial intelligence communities ontologies refer to the general organizations of concepts and entities found in knowledge representations, which are sharable and reusable across knowledge bases (Bateman, 1995). In natural language processing, ontologies have been primarily used for modeling the semantics of lexical items (Dahlgren, 1995).

3.1 The Function of Thesaurus Terms

The main function of a thesaurus is to generalizeormake uniform terms that have a related meaning, but unrelated surface forms, into more general and uniform index terms. More specifically, a thesaurus has the following functions (see also Miller, 1997).

1. A first important function is to control the synonym problem of natural language (Salton, 1975b, p. 461). Synonym words (e.g., “pests” and

“vermin”) can be handled by word substitution. The thesaurus puts words that are synonyms and are intersubstitutable into equivalence classes. If natural language contains several terms that might be used to represent

the same or nearly the same concept, the thesaurus usually guides the choice of vocabulary toward a single valid term. Even in restricted subject domains, a synonym list can become quite large. Substitution with true synonyms can be handled effectively, but there is the problem of near synonyms. A thesaurus can also be used to generalize morphological and syntactical variants of index terms when stemming terms or normalizing phrases (see chapter 4).

2. In case the thesaurus offers a hierarchical relationship between the words that it contains, it can be employed tobroaden terms (Salton, 1975b, p.

461). Then, a term extracted from the text is replaced by a broader thesaurus class term. Such broad index terms are useful for generic searches and routing tasks. Occasionally, a thesaurus can be used to narrow terms.

3. In natural language many words have more than one semantic meaning or sense. So, a thesaurus may contain word senses from which the meaning of a polysemous or homonymous word may be chosen (Voorhees, 1994). For indexing text, the use of such a thesaurus supposes a procedure for identifying the meaning. Techniques for word sense disambiguation(see Krovetz & Croft, 1992; Guthrie, Pustejovsky, Wilks,

& Slator, 1996) include the application of knowledge of the syntactic class of the word to be indexed (e.g., noun) and of domain knowledge that relates a word class to a word meaning. When a word sense is not or not solely determined by its syntactic class, selecting the correct word sense or the most probable word sense is only feasible by considering the context in which the term occurs. A word's context varies from the local context (e.g., words in the same sentence or surrounding sentences) and the complete text in which the word occurs, to the complete corpus (e.g., to disambiguate word senses in short texts). How best to characterize the contexts associated with word senses for automated word sense disambiguation remains an open question. When people disambiguate word senses in reading, they seem to make more use of local context: the exact sequence of words immediately preceding and following the polysemous word (Miller, 1995). Machine-readable dictionaries employed in word sense disambiguation contain for each sense of each word a short textual description. This description can be used in disambiguation, for instance, by searching for occurrences of words from the description in the document (Lesk, 1986 cited in Krovetz & Croft, 1992). Alternatively, categories can be defined representing the different senses of a word (Voorhees, 1994). Then, the number of words in the text that have senses that belong to a given category is counted. The senses that correspond to the category with the largest counts are selected to be

the intended senses of the ambiguous words. In restricted subject domains contextual rules can be implemented to disambiguate word senses (Krovetz & Croft, 1992). We refer to the special issue of Computational Linguistics on word sense disambiguation (24 (1), 1998).

Thesaurus class terms have been effective for indexing document texts.

They can replace the natural language index terms extracted from a text.

Alternatively, they can complement the natural language index terms of a text representation. This is in analogy to the use of a thesaurus to expand terms of a query with related terms in a retrieval system (van Rijsbergen, 1979, p. 31 ff.; Salton & Lesk, 1971; Fox, 1980; Gauch & Smith, 1991).

Thesaurus class terms enhance the recall of a retrieval operation. A thesaurus is useful to index a text by word senses. Indexing by word senses increases the precision of a retrieval operation (Krovetz & Croft, 1992). Especially in restricted subject domains where the community of scholars and scientists working in the discipline shares word meanings, thesaurus class terms are very useful index terms. However, for heterogeneous text collections, more must become known about the desired form and content of thesauri and about the processes of word sense disambiguation that can be automated (Smeaton, 1992; Schütze & Pedersen, 1994).

3.2 Thesaurus Construction and Maintenance

An important problem with the use of thesauri is their construction and maintenance. Thesauri are usually manually constructed. Sometimes, on-line versions of existing published dictionaries are available. Additionally, there are efforts to automatically or semi-automatically build thesauri.

Building a thesaurus manually or intellectually is a time-consuming and costly task. It is usually constructed by a committee of experts who review the subject matter and propose reasonable class arrangements (Salton, 1989, p. 301). The thesaurus classes cover restricted topics of specified scope and they collectively cover the complete subject area evenly. Hand-built thesauri are often only confined to restricted subject domains and are usually not employable outside the collections.

However, in the past many dictionaries have been built manually.

Dictionary entries evolved for the convenience of human readers, and not for being used by machine. But, this is changing. The thesaurus becomes an online version of a semantically coded dictionary (see Guthrie et al., 1996 for an overview). Roget already in 1946 used a procedure for compiling a thesaurus of English words (cited in Luhn, 1957). He created categories of words that had a family resemblance on a conceptual level and arrived at

approximately 1000 of these categories. Also, in theLongman’s Dictionary Of Contemporary English ( LDOCE ) (published in 1981) lexicographers supplemented the machine-readable version with codes that give the semantic category of a word. LDOCE can be used to disambiguate word senses. Parsers have been developed that analyze the definition texts of LDOCE (see Boguraev & Briscoe, 1989). Networks of noun senses for both the LDOCE and the Dutch Van Dale Dictionary have been created using a technique for disambiguation that combines information from both dictionaries with information from the Van Dale bilingual Dutch-English dictionary (Guthrie et al., 1996). Another example of an on-line dictionary is

WordNet, a lexical database for English developed at Princeton University, NJ (Miller, 1990, 1995). It contains words, word senses, syntactic word classes, and important semantic relations between words. A current goal of WordNet is developing tools for determining a word sense based on the context in which a word is used. An important on-line lexical database for Dutch is CELEX, created by the Centre for Lexical Information at the Katholieke Universiteit Nijmegen. The availability of large on-line thesauri increases the applicability of assigning thesaurus class terms when indexing (Fox, Nutter, Ahlswede, Evens, & Markowitz, 1988; Liddy & Myaeng, 1993; Liddy & Paik, 1993; Liddy, Paik, & Yu, 1994). A generic on-line published thesaurus is often restricted to common usage of words. When used for technical domains, which have their own terminology, it will have serious coverage gaps. Specialized dictionaries that cover the important terms and concepts of their disciplines may expand the coverage of a standard dictionary.

One major disadvantage inherent to the use of any thesaurus is the necessity to maintain it. New thesaurus classes of interest emerge and the thesaurus needs to accommodate for collection growth. Especially, in some disciplines where the vocabulary changes rapidly (e.g., computer science) maintenance of the thesaurus is important. The cost of implementing and maintaining an on-line thesaurus, as well as the need for collection-specific thesauri, incites research to build thesauri automatically or semi- automatically. Research focuses in discovering related words directly from the contents of a textual database. This research dates back to Dennis (1967), to Sparck Jones' work on term classification (1970, 1971), to Salton's work on automatic thesaurus construction and query expansion (1968, 1980), and to van Rijsbergen's work on term co-occurrence (van Rijsbergen, Harper, &

Porter, 1981). Generally, thesauri generated automatically attempt to identify semantic relationships between words based on statistical and syntactic patterns.

3.2.1 Statistical methods

The statistical methods are based on patterns of word co-occurrence in texts of a sample collection (Jing & Croft, 1994). The methods assume that words that are contextually related, i.e., often appearing in the same sentence, paragraph, or document, are semantically related and hence should be classified in the same class. The more specific the context in which the words occur, the more precise the classification will be. A common procedure is to compute the similarity between a pair of terms based on coincidences of the terms in texts. When pair-wise similarities are available between all useful term pairs, an automatic term-classification process can collect all terms into common classes with sufficient large pair-wise similarities (Sparck Jones, 1971, p. 45 ff.). Among these term-classification strategies are single-link and complete link class-construction methods (Salton, 1989, p. 302). In a single-link classification system, each term must have a similarity exceeding a stated threshold value with at least one term in the same class. In the complete link orclique classification, each term has a similarity to all other terms in the same class that exceeds the threshold value. Alternatively, term classifications can be automatically constructed by adapting an existing document classification and by assuming those terms that occur jointly in the document classes could be used to form the desired term classes (cf. below learning of text classifiers). Peat and Willett (1991) argue against the utility of co-occurrence information in thesaurus construction. They observe that because synonyms often do not occur together in the same context, a co-occurrence based approach may have difficulty identifying synonymy relations. Although synonyms frequently do not co-occur, they tend to share neighbors that occur with both. Schütze and Pedersen (1 994) define semantic closeness between terms as having the property of sharing common neighbors.

Statistically based thesaurus construction can yield acceptable results when learned from a large corpus of texts with a specialized vocabulary, but the technique is questionable with heterogeneous text databases (Salton &

McGill, 1983, p, 228; Jing & Croft, 1994). Moreover, the technique simply detects associations between terms (e.g., synonyms and near synonyms, broader and narrower terms). Detecting the specific nature of these associationsis usually beyond their scope.

3.2.2 Syntactic methods

Thesyntactic methods employ syntactic relations to determine semantic closeness of terms, A typical approach is to construct a hierarchical thesaurus from a list of complex noun phrases of a text corpus exploiting the head-modifier relationship of the noun phrases (Evans, Ginther-Webster, Hart, Lefferts, & Monarch, 1991). Here, the head is considered the more general term, which subsumes the more specific concept expressed by the phrase (e.g., “intelligence” subsumes “artificial intelligence”). Heads and modifiers are the smallest possible contexts of terms. Another example of constructing a thesaurus with syntactic information is to base a classification of nouns upon their being the subject of a certain class of verbs (Tokunaga, Iwayama & Tanaka, 1995). A better selection of terms that are syntactically associated can be obtained by combining the syntactic approach with statistical characteristics, such as the frequency of the associations (Ruge,

1991).

Dalam dokumen AND ABSTRACTING (Halaman 123-128)