THE SELECTION OF NATURAL LANGUAGE INDEX TERMS
6. THE SELECTION OF PHRASES
(Salton et al., 1990; Croft et al., 1991). The proximity of phrase components can be defined by their number of intervening words or by their occurrence in the same sentence, paragraph, or whole text (Salton & McGill, 1983, p. 84 ff.). When for a given candidate phrase, the values of the above parameters are within threshold values (set after experiments with the text collection), it is selected as index term.
Occurrence frequency and proximity parameters do not always yield correct and meaningful phrases. Two or more words possibly co-occur for reasons other than being part of the same phrasal concept. It is therefore not surprising that Fagan (1989) found that the use of statistical phrases did not significantly increase retrieval performance.
6.2 Syntactic Phrases
A syntactic phrase may be selected by its occurrence frequency, the co- occurrence of its components, and/or upon the proximity of its components in the text, but there is always a syntactic relationship between the phrase components (Salton & McGill, 1983, p. 90 ff.; Croft et al., 1991;
Strzalkowski, 1994; Strzalkowski et al., 1997). A syntactic phrase is a grammatical part of the sentence and is, at least in part, identified based upon linguistic criteria. The use of syntactic phrases is based on the assumption that words of a text that have a syntactic relationship often have a correlated semantic relationship (Smeaton & Sheridan, 1991). Syntactic phrase recognition has been popular for decades (an overview see Schwarz, 1990).
In the following we describe the main recognition methods.
The simplest method uses a machine-readable dictionary or thesaurus that contains pre-coded phrasal terms according to various syntactic formats (cf. Evans, Ginther-Webster, Hart, Lefferts, & Monarch, 1991). Such dictionaries should encompass the many ways in which individual words can be combined to express the same concept, so their use is only practical in restricted subject domains.
A more realistic, but language-dependent method is based on the idea that content bearing phrases belong to certain grammatical classes or combinations of classes. The method has two steps: identification of the classes (parts-of-speech) of the words of the text and recognition of combinations of word classes in the text.
Word classes are defined by using a machine-readable dictionary of words with their classes or by using a stochastic tagger. Astochastic tagger (Dermatas & Kokkinakis, 1995) assigns part-of-speech tags to the words of a text based on the probability that the tag should be assigned to a word. This probability is computed taking into account the probability of a part-of-
speech tag for the specific word and the probability that a specific tag is appropriate for the particular context. The lexical and contextual probabilities are obtained from observing statistical regularities in example texts that are manually tagged with part-of-speech mark-ups.
There are two major ways for identifying combinations of word classes in texts: the use of syntactic templates and the parsing based on a context free grammar.
The former refers to matching patterns of adjacent classes against a library ofsyntactic templates (example of a template: adjective followed by a noun) (Dillon & Gray, 1983; Fuhr & Knorz, 1984).
In the latter way, a context-free grammar, which contains the rules of the allowable syntax of the sentences, is used to obtain for each sentence a parse that shows its syntactic structure (see chapter 6) (Salton 1968, p. 151 ff.;
Metzler & Haas, 1989; Salton et al., 1990; Schwarz, 1990; Smeaton &
Sheridan, 1991). The result of the parsing is captured with the formalism of a dependency tree, which reflects the logical predicate-argument structure of a sentence. The tree indicates dependencies between the phrase components of the sentence (e.g., head and modifier of a phrase). In this way, differences in meaning between phrases, such as “college junior” and “junior college”, are detected. Simple phrase structure grammars can be used to recognize many types of noun phrases and prepositional phrases that might constitute useful text identifiers. The simple grammars cannot account for all phrase structures and must be complemented with semantic knowledge in case of ambiguous syntactic structures (e.g., in the phrase “increasingly dangerous misadventures and accidents” the “accidents” are or are not “increasingly dangerous”) (Lewis, Croft, & Bhandaru, 1989). However, these problems do not prevent that currently there exists noun phrase recognition parsing algorithms that operate with low error rates.
Usually, a number of phrases are selected based upon their combination of grammatical classes, phrase frequency, and phrase weight (see below) (cf.
Salton et al., 1990).
It must be noted that a compound noun in Dutch generally concatenates two (or more) words to create a single orthographic word. In case of compound nouns that were not split during a stemming procedure (see above), single Dutch words sometimes express very specific indexing concepts (e.g., “onroerendgoedmarkt” (“market of real estate”)).
Compared to single term indexing, Fagan (1989) found that syntactical phrase recognition only very slightly improved retrieval performance (cf.
Strzalkowski, Ling, Perez-Carballo, 1998). A disadvantage of syntactic methods is their high demand of computer power, storage space, and program availability .
Part of the discouraging effect of the use of phrases in text retrieval is because they must be normalized to a standard form and they must be effectively selected. Normalization is discussed in the next section. The weighting of phrases for content representation is discussed further in this chapter. The solutions proposed primarily relate to noun phrases, because noun phrases are mostly selected from a text.
6.3 Normalization of Phrases
Indexing the text by considering phrases assumes that phrases refer to meaningful concepts. When in a retrieval environment a phrase appears in both query and document text, the two may refer to the same concept. This approach is limited by the fact that the phrase must appear in the same form in the document text and query in order for the concept to be matched (Lewis et al., 1989; Smeaton, 1992). However, this is rarely the case with phrasal terms. A same concept can be expressed using different syntactic structures (e.g., “a garden party” and “a party in the garden”), possibly combined with lexical variations in word use (e.g., “prenatal ultrasonic diagnosis” and “in utero sonographic diagnosis of the fetus”) or with morphological variants (e.g., “vibrating over wavelets” and “wavelet vibrations”). Phrases may contain anaphors and ellipses. Correct mapping to a standard single phrase must take into account lexical, syntactic, and morphological variations and resolve anaphors and ellipses. In a retrieval environment, phrase normalization enhances the recall of a retrieval operation (Salton, 1986).
The following concerns important methods in phrase normalization.
1. A simple method is to use a machine-readable dictionary of phrase variants (e.g., Evans et al., 1991). Currently, such a dictionary is hand- built, which limits the method to restricted subject domains.
2. The omission of function words (e.g., propositions, determiners, pronouns) and possible neglecting of the order of the remaining content words forms another easy, but not always reliable, phrase normalization method (Dillon & Gray, 1983; Fagan, 1989).
3. A more secure method for recognition of syntactic variants is based on syntactical phrase recognition. It uses the output of a syntactic parse of a sentence and defines (meta)rules for equivalent phrases (Jacquemin &
Royauté 1994; Strzalkowski et al., 1997; Tzoukermann, Klavans, &
Jacquemin, 1997; cf. Sparck Jones & Tait, 1984). This approach may be combined with anaphoric resolution (see Grishman, 1986, p. 124 ff. and Lappin & Leass, 1994) and word stemming.
6.4 Recognition of Proper Names
A special case of phrase recognition in texts is the selection of proper namesorproper nouns (Rau, 1992; Jacobs, 1993; Mani & MacMillan, 1996;
Paik, Liddy, Yu, & McKenna, 1993; Strzalkowski et al., 1997). Indexing with important proper names is useful in many retrieval applications. Proper names regard names of persons, companies, institutions, product brands, locations, and currencies. There are two major ways for recognizing them.
1. The application of a lexicon or machine-readable dictionary of names requires an existing database of names, provided on an external basis (e.g., Hayes, 1994). Composing the database of names manually is only possible for applications with a narrow scope. The lexicon may provide name variants.
2. Because many proper names (e.g., companies) appear, disappear, or change, accurate identification requires recognizing new names. They are recognized by special rules that express the typical features of proper name phrases (e.g., capitalization) or the linguistic context (e.g., indicator words) in which the names ought to be found (Jacobs, 1993; Hayes,
1994; Cowie & Lehnert, 1996). Recognition is sometimes problematic (e.g., “van Otterloo & Coo”).
Proper name recognition tools must cope with the many variants that occur. Variation in names concerns: suffix words (e.g., “Inc”, “N.V.”), prefix words (e.g., personal titles), other optional words (e.g., “van”), alternate words (e.g., “Intl Business Machines” and “International Business Machines”), alternate names (e.g., “IBM” and “Big Blue”), forenames (e.g.,
“Gerald Thijs”, “G. Thijs”, and “Thijs”), punctuation (e.g., “Sensotec N.V.”
and “Sensotec NV”), case sensitivity (e.g., “SigmaDelta” and “Sigmadelta”), and hyphenation (e.g., “Sigma Delta”, “Sigma-Delta”, and “SigmaDelta”).
One way to resolve variants is by defining similarities between names based on shared letter sequences ( n-grams) (cf. Pfeifer, Poersch, & Fuhr, 1996).
Another challenging problem is recognition of the semantic category of the proper names (e.g., identifying personal names, company names) (McDonald, 1996; Paik et al., 1993; Paik, Liddy, Yu, & McKenna, 1996).
The category of a proper name can be extracted from the machine-readable dictionary, if available. Alternatively, the category can be detected by applying context heuristics that are developed from analysis of contexts in an example corpus.