THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS
4. SUBJECT AND CLASSIFICATION CODES
3.2.2 Syntactic methods
Thesyntactic methods employ syntactic relations to determine semantic closeness of terms, A typical approach is to construct a hierarchical thesaurus from a list of complex noun phrases of a text corpus exploiting the head-modifier relationship of the noun phrases (Evans, Ginther-Webster, Hart, Lefferts, & Monarch, 1991). Here, the head is considered the more general term, which subsumes the more specific concept expressed by the phrase (e.g., “intelligence” subsumes “artificial intelligence”). Heads and modifiers are the smallest possible contexts of terms. Another example of constructing a thesaurus with syntactic information is to base a classification of nouns upon their being the subject of a certain class of verbs (Tokunaga, Iwayama & Tanaka, 1995). A better selection of terms that are syntactically associated can be obtained by combining the syntactic approach with statistical characteristics, such as the frequency of the associations (Ruge,
1991).
4.2 Text Classifiers with Manually Implemented Classification Patterns
A knowledge base is an abstract representation of a topic area, or a particular environment, including the main concepts of interest in that area, and the various relationships between the entities. The construction of the knowledge base containing the patterns, concepts, and categorization rules is done by a knowledge engineer after careful analysis of texts in an example text base that is manually classified by experts (Sparck Jones, 1991). The classification patterns are thought to be predictable for new texts. A knowledge representation language or formalism is required that allows describing the domain of interest, expressing entities, properties, and relations (Edwards, 1991, p. 60 ff.).
1. The most common form for representing the text patterns and their relationships with the subject and classification concepts is by using production or decision rules. The condition usually involves single cue
words, word stems, or phrases that logically combine using propositional or first-order logic. A rule has the form:
IF <condition is true>
THEN <assign category>
2. Occasionally, frames are used to represent the attributes of a particular object or concept in a more richly descriptive way than is possible using rules. The frame typically consists of a number of slots, each of which contains a value (or is left blank). The number and type of slots will be chosen according to the particular knowledge to be represented. A slot may contain a reference to another frame. Other features of frames have advantages: They include the provision of a default value for a particular slot in all frames of a certain type, and the use of more complex methods for “inheriting” values and properties between frames. When frames have mutual relationships, a semantic net of frames can represent them.
Frames allow combining sets of related words with simple syntactic templates or with specifications that certain words occur within the same sentence, paragraph, or other context. They also allow representing semantic structures such as verbs describing classes of events.
The actual classification process simulates a text skimming for the cue patterns defined in the rule or frame base, possibly accompanied by an assessment of their attribute values and followed by an evaluation of the logical constraints imposed on them. The document text is only partially
parsed in order to detect the patterns, whereby the parsing is often restricted to a pattern matching procedure.
Knowledge bases have been proven successful for classifying documents in office environments (Chang & Leung, 1987; Eirund & Kreplin, 1988;
Pozzi & Celentano, 1993; Hoch, 1994). But, also in broader subject domains such as categorization of news stories this approach proved to be fortunate (Mc Cune, Tong, Dean, & Shapiro, 1985; Young & Hayes, 1985; Riloff &
Lehnert, 1994; Jacobs, 1993; Gilardoni, Prunotto, & Rocca, 1994). The famous CONSTRUE/TIS system (Hayes & Weinstein, 1991; Hayes, 1992) classifies a stream of Reuters economic and financial news stories into about 674 categories with precision and recall rates of the assignment of subject codes compared to expert assignments in the 90%.
Knowledge bases that describe the classification patterns and their relation with a subject or classification code have been successfully applied in text categorization. The results approximate human indexing, which proves that surface text features can be identified that successfully discriminate the subject and classification codes linked to a text. The knowledge representation is primarily controlled by semantic knowledge that often only characterizes a particular domain of discourse. When the number of patterns necessary to correctly categorize the texts of a document corpus is restricted, the construction and maintenance of a handcrafted knowledge base is a realistic task. In other circumstances, the machine learning methods discussed in the next sections provide an interesting alternative.
4.3 Text Classifiers that Learn Classification Patterns
Training a text classifier involves the construction of a classification procedure from a set of example texts for which the true classes are known.
This form of learning is called pattern recognition, discrimination, or supervised learning(in order to distinguish it from unsupervised learningin which the classes are inferred from the data). The general approach is as follows. An expert, teacher, or supervisor assigns subject or classification codes to the example texts of the training set, which is also called the learning set or design set. It is assumed that this assignment is correct. Then, a classifier is constructed based on the training set. The aim is to detect general, but high-accuracy classification patterns and rules in the training set, which are highly predictable to correctly classify new, previously unseen texts. The set of new texts is called the test set. Because text classes are not mutually exclusive, it is convenient to learn a binary classifier for each class,
rather than to formulate the problem as a single multi-class learning problem.
More specifically, the archetypal supervised classification problem is described as follows (Bishop, 1995, p. 1 ff.; Hand, 1997, p. 5 ff.). Each object is defined in terms of avector of features (often numerical, but also possible nominal such as color, presence or absence of a characteristic).
x=(x1,x2, … ,xn ) (8)
where
xj= the value that the featurejtakes for objectx j = 1 ..n (n = number of features measured).
The features together span a multi-variate space termed the measurement spaceorfeature space.For each object of the training set, we know both the feature vector and the true classes. The features of texts are commonly the words and phrases. The number of them is very large, creating the necessity of effective feature selection andextractionwhen training text classifiers. A text classifier learns from a set of positive examples of the text class (texts relevant for the class) and possibly from a set of negative examples of the class (texts non-relevant for the class). From the feature vectors of the examples, the classifier typically learns a classification function, a category weight vector, or a set of rules that correctly classifies the positive examples (and the negative examples) of the class. Each new text is equally represented as a feature vector, upon which the learned function, weight vector, or set of rules is applied to predict its class. Because, there are usually many classes and only few of them are assigned to a given example text, the number of negative examples in a training set exceeds the number of positive ones. Using negative relevance information is often a necessity when lacking positive relevance data.
Three broad groups of common training techniques can be distinguished for the pattern recognition problem (see Michie, Spiegelhalter, & Taylor,
1994): statistical approaches, learning of rules and trees, and neural networks. Another distinction can be made between parametric and non- parametric methods (Weiss & Kulikowski, 1991, p. 12 ff.). In the parametric training methods, the parameters are estimated from the training set by making an assumption about the mathematical functional form of the underlying population density distribution, such as a normal distribution.
Then, the pattern discriminant functions that are used to classify new texts are based on these estimates. Non-parametric training methods make no such assumption about the underlying parameters. Here, the classification
functions initially have unspecified coefficients, which are adjusted or set in such way that the discriminant functions perform adequately on the training set. In work on learning in the artificial intelligence community, pattern recognition is often treated as one ofsearch (Mitchell, 1977). The program is viewed as considering candidate functions or patterns from a search or hypothesis space, evaluating them in some fashion, and choosing one that meets a certain criterion. Searching a hypothesis space becomes especially explicit in methods that learn trees and rules.
Besides the many algorithms developed for pattern recognition, text