THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS
5. LEARNING APPROACHES TO TEXT CATEGORIZATION
5.1 Feature Selection and Extraction
5.1.1 Feature selection
Feature selection aims at eliminating low quality features and at producing a lower dimensional feature space. The real need for feature selection arises for problems with a large number of features and with relatively few samples of each class to be learned (Weiss & Kulikowski, 1991, p. 72 ff.), which is the case in text categorization. Feature selection is done manually by human experts or with automated tools, the latter usually being applied in text categorization.
A limited number of features is advantageous in classification (Weiss &
Kulikowski, 1991, p. 72 ff.). A limited feature set benefits efficiency and decreases computational complexity. It reduces the number of observations to be recorded and the number of hypotheses to test in order to find an accurate classifier. But more importantly, a small feature set decreases the danger of overfitting or overtraining. Overfitting means that the learned classifier perfectly fits the training set, but does not perform well, when applied upon new, previously unseen cases. The classifier fails to generalize sufficiently from the training data and is too specific to classify the new cases. A large number of features enhances this effect. So, when using many features we need a corresponding increase in the number of samples to ensure a correct mapping between the features and the classes. The property of too many features is known as the dimensionality problem (Bishop, 1995, p. 7 ff.; Hand D.J., 1997, p. 3 ff.).
Feature selection removes redundant and noisy features. Noise is defined as erroneous features in the description of the example (Quinlan, 1986) or as features that are no more predictive as by chance (Weiss & Kulikowski,
1991, p. 11). Noisy features lead to overfitting and to poor accuracy of the classifier to new instances.
Feature selection is done before training, during training, and after classification of new, previously unseen objects. When done before training, it is usually the quality of an individual feature that is evaluated and the feature is removed from the feature set after a negative evaluation. During training, some algorithms incorporate a feature selection process. This is especially true for algorithms that induce decision trees or rules from the sample data. They often include stepwise procedures, which incrementally add features, discard features, or both, evaluating the subset of features that would be produced by each change. Feature selection is done after classification of new objects by measuring the error rate of this
classification. Those features are removed from or added to the feature set when this results in a lower error rate on the test set. The choice of a feature selection technique is usually application-specific and domain knowledge is considered important in the feature selection process (Nilsson, 1990, p. 4).
5.1.2 Feature extraction
Feature extraction, also called re-parameterization,creates new features by applying a set of operators upon the current features (Hand D.J., 1997, p.
15 1 ff.). Although a single feature can be replaced by a new feature, it more often occurs that a set of features is replaced by one feature or another set of features. Logical operators such as conjunction and disjunction can be used.
Operators such as the arithmetic mean, multiplication, linear combination, and threshold functions can be sensibly applied to many numeric functions.
When a set of original features are thought to be redundant manifestations of the same underlying feature, replacing them with a single feature corresponding to their sum, disjunction, mean, or some other cumulative operation is a good approach. Often, operators that produce a linear transformation of the original features are used (e.g., factor analysis).
Operators can be specific to a particular application and domain-knowledge is considered important in a feature extraction process (Bishop, 1995, p. 6).
Feature extraction can be done before training, when the original features of each example object are transformed into more appropriate features. Feature extraction can also bepart of training, such as the computation of a feature vector for each class from the feature values of the individual examples.
5.1.3 Feature selection in text categorization
The salient features of a text in a classification task are its words and phrases. The training of a text classifier is often preceded by an initial elimination of a number of irrelevant features. Individual features are judged and possibly removed. A feature can be eliminated with respect to its overall relevance in determining a text’s content, or with respect to its value in determining a particular category.
As it is seen in chapter 4, the words and phrases of a text do not contribute equally to its content. So, similarly to the process of extracting content terms from texts, a number of text features can be selected from a text that are supposed to reflect its content. The techniques include the elimination of stopwords and weighting words and phrases according to their distribution characteristics, such as frequency of occurrence (e.g., Cohen, 1995), position on a Zipf curve (e.g., Sahami, Hearst, & Saund, 1996), fit to
a Poisson distribution (e.g., Ng, Loewenstern, Basu, Hirsh, & Kantor, 1997), followed by a removal of words with a low weight. Aggressive removal of words with a domain-specific stopword list is sometimes used (e.g., Yang,
1995; Yang & Wilbur, 1996). The opposite is using a list of valid feature terms from a domain-specific dictionary. Knowledge of the discourse structure and of the value of certain text positions or passages for feature selection is considered important, especially in long texts (e.g., Maron, 1961;
Borko & Bernick, 1963; Fuhr, 1989; Jacobs, 1993; Apté, Damereau, &
Weiss, 1994; Yang, Chute, Atkin, & Anda, 1995; Thompson, Turtle, Yang,
& Flood, 1995; Brüninghaus & Ashley, 1997; Leung & Kan, 1997).
Feature selection with respect to the relevance in determining a text's content involves a reduction of the dimensionality of the global feature space. Such an initial selection is often not sufficient as feature selection technique given the large number of text features. So, there are a number of useful techniques that select the features per class to be learned and that take into account the distribution of a feature in example texts that are relevant or non-relevant for the subject or classification code. In general, a good feature is a single word or phrase that has a statistical relationship with a class, i.e., having a high proportion of occurrences within that particular class and a low proportion of occurrences in the other classes. Many feature selection techniques for training text classifiers are based upon this assumption. The techniques compute for each class the relevance score of a text feature, i.e., strength of the association between the class concept and the feature, and eliminate features with a low score. Many of the scoring functions originated from relevance feedback research, but have been used in text categorization (e.g., Maron, 1961; Field, 1975; Hamill & Zamora, 1980; Voorhees &
Harman, 1997). A feature typically is ranked by the difference in relative occurrence in relevant and non-relevant texts for the subject or classification code (Allan et al., 1997) and by its difference in mean weights in relevant and non-relevant texts for the subject or classification code (Brookes, 1968;
Robertson, Walker, Beaulieu, Gatford, & Payne, 1996; cf. Rocchio algorithm below). Finally, there are the techniques that assume a probability distribution of the feature in the example set and employ the deviations from this distribution in feature selection. Leung and Kan (1997) use the deviation of the value of the feature from its mean value in the example set normalized by the standard deviation (z-score). Theχ2(chi-square) test measures the fit between the observed frequencies of the features in texts of the example set and their expected frequencies (i.e., the terms occur with equal frequencies in texts that are relevant for the class and in texts that are non-relevant for the class) and identifies terms that are strongly related to the text class (Cooper, Chen, & Gey, 1995; Schütze, Pedersen, & Hearst, 1995; Schütze,
Hull, & Pedersen, 1995; Hull et al., 1997). In still another technique, a binomial probability distribution is used to compute the probability that a text feature occurs in the texts relevant for the subject or classification code purely by change and to relate a low probability with a high descriptive power for the text class (Yochum, 1995; cf. Dunning, 1993).
Although feature selection is an absolute necessity in text categorization, caution must be taken in removing text features. For some text classes, words that seem to have low overall content bearing value can be an important category indicator especially in combination with other terms (Riloff, 1995; Jacobs, 1993; cf. Hand D.J., 1997, p. 150).
5.1.4 Feature extraction in text categorization
There are a number of feature extraction techniques that can be employed in text categorization. The process ofstemming (see chapter 4) reduces a number of text features to one single term or feature (e.g., Schütze, Hull, &
Pedersen, 1995). The phrase formation process is sometimes seen as feature extraction. A phrase groups original single words of a text that are statistically and/or syntactically related (e.g. , Finch, 1995). Weightingof the original text feature aims at increasing the predictive value of the feature.
Weighting includes the traditional weighting schemes for content identification (e.g., term frequency ( tf), inverse document frequency ( idf), and tf xidf) (e.g., Yang & Chute, 1994) and the relevance scoring functions that determine the weight of a term for a text class (see above). The use of thesauri also transforms the original text features with more uniform and more general concepts (e.g., Blosseville, Hébrail, Monteil, & Penot, 1992), whereby the groups of semantically related words can automatically be built (e.g., Baker & McCallum, 1998). Latent Semantic Indexing (LSI) replaces the text features (usually words) of a document set by their lower dimensional linear combination. This is done by singular value decomposition of the feature by document matrix (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; use of LSI in text categorization:
Hull, 1994; Dumais, 1995; Schütze, Hull, & Pedersen, 1995).
5.1.5 A note about cross validation
Another technique to overcome the overfitting problem is cross validation(Henery, 1994), wherein the parameters of the model are updated based on the error detected with the validation set. A part of the training set (e.g., two thirds) is used for training the classifier, while the remainder is used as the validation set. During training, errors in classifying the validation
set help in selecting features or in determining when overfitting has occurred. The latter refers to training procedures, which iterate to find a good classification rule (e.g., training of a neural network). At each iteration, the parameters of the model are updated and the error is computed upon the validation set. Training continues until this error increases, which indicates that overfitting has set in (cf. Schütze, Hull, & Pedersen, 1995).