TEXT REPRESENTATIONS AND THEIR USE
7. CHARACTERISTICS OF GOOD TEXT REPRESENTATIONS
The ultimate aim of indexing and abstracting is to increase recall, the proportion of relevant documents that are viewed or retrieved in a browsing and retrieval system respectively, and to increase precision, the proportion of viewed or retrieved documents that are relevant (cf. Salton, 1989, p. 277 ff.).
A high recall in a question-answering system refers to a high proportion of correct answers given the available answers, while a high precision concerns a high proportion of correct answers among the answers (Chinchor, 1992;
Chinchor, Hirschman, & Lewis, 1993). A text representation that is the result of indexing and abstracting has a number of characteristics in order to increase recall and precision of selected documents or information.
Depending upon the application, each characteristic has a varying degree of importance. Some of these characteristics can be described solely by referring to the original source text. Others are defined in relation to the other text representations in the document collection. The following outlines some important characteristics, some of which represent conflicting demands.
1. A major characteristic of the text representation is the ability to represent the aboutness or the topics of a document text (Maron, 1977; Hutchins, 1985). Topic identification is highly valued in browsing, retrieval, and filtering systems, especially when these systems operate in general settings (e.g., public libraries, Internet). Besides aboutness is the ability to represent the potential meanings that a text has for its users (Hutchins, 1977; Salton & McGill, 1983, p. 54; Hutchins, 1985; Lancaster, 1991, p.
8; Fidel, 1994). This might be realized by a more detailed indexing or
abstracting resulting in a representation of the subtopics and of specific information of the source text. This "user orientation" in indexing and abstracting allows a fine-grained selection of topical content. This property is highly valued in information retrieval systems that are used by specialists and experts (e.g., research libraries, databases of medical documents) and in question-answering systems.
2. In contrast to the foregoing, a text representation is often a reduction of the content of the original text. This reduction can be the result of a generalization or of a selection of the content. This characteristic is important when retrieving or filtering information from large document collections (Sparck Jones, 1991). When indexing descriptions or abstracts are used as text previews in browsing or navigation systems, this reductive character is also fundamental.
3. It is not enough for a text representation to be a good description of the content of the source text. It should allow differentiating its content from the contents of other text representations (Lewis & Sparck Jones, 1996).
This characteristic is especially useful in browsing and retrieval systems, when the text representation has to discriminate relevant documents from the many non-relevant ones. If the text representation reduces the content, it naturally reduces the difference with other text representations.
Again, being discriminative and being reductive do not always go hand in hand.
4. When browsing large document collections or retrieving information from them, it is important to consult all relevant documents. In these collections, when similar text representations are grouped, texts can be efficiently retrieved or consulted with a high degree of recall (cf. the clustering retrieval model) (Lewis & Sparck Jones, 1996). In this case, text representations must contain content elements that allow grouping.
This characteristic also conflicts with the foregoing requirement of being discriminative.
5. Finally, a text representation normalizes lexical and conceptual variations of the source text (Hutchins, 1975, p. 37 ff.). This characteristic is advantageous in information retrieval and filtering systems, and especially important in question-answering systems.
Text representations themselves are judged by the criteria of exhaustivity, specificity, correctness, and consistency (Salton & McGill, 1983, p. 55;
Lancaster &Warner, 1993, p. 81 ff.; Soergel, 1994).
1. Exhaustivity refers to the degree to which all the concepts and notions included in the text are recognized in its description, including the central topics and the ones treated only briefly.
2. Specificityrefers to the degree of generalization of the representation.
3. Correctnessis important. Indexing and abstracting are susceptible to two kinds of errors: errors of omission and errors of commission. The former refers to a content description that should be assigned, but is omitted. The latter refers to a content description that should not be assigned, but is nevertheless attributed. Omitting a correct description and assigning a broader, narrower, or related description is a special kind of error that is at once an error of omission and commission. Correctness compares the actual text representation with the ideal one.
4. Consistency compares representations that are made of the same source text in different contexts (e.g., generated by different techniques).
When evaluating automatic indexing and abstracting, exhaustivity, and specificity are difficult to quantify. Current evaluation emphasizes correctness and consistency. Automatic text indexing and summarization are usually seen as natural language processing tasks. The criteria applied in performance evaluation of such tasks normally fall under two major heads, intrinsic and extrinsic (Sparck Jones & Galliers, 1996, p. 19ff.). Intrinsic criteriaare those relating to a system's objective, extrinsic criteria are those bearing upon its function, i.e., to its role in relation to its setup's purpose. It often depends upon the type of text representation whether the evaluation is intrinsic or extrinsic. For instance, the value of extracted natural language index terms, is usually measured by computing the recall and precision of the retrieval of texts based on representations that contain the terms, which is an extrinsic evaluation. On the other hand, controlled language subject and classification codes are judged by measuring the recall and precision of the assigned terms as compared to their manual assignment by experts, which is an intrinsic evaluation. When discussing the methods of automatic indexing and abstracting in the next part, evaluation will be shortly described with each major approach. It is agreed upon that evaluation of text indexing and abstracting needs further research (cf. Hersh & Molnar, 1995).
The idea of an exhaustive, multi-functional text representation for managing document texts is appealing. It allows producing multiple views of the same text and consequently selecting specific information conforming to different needs (cf. Soergel, 1994; Lucarella & Zanzi, 1996; Frants, Shapiro,
& Voiskunskii, 1997, p. 139 ff.). Additionally, when the content attributes have weighted values that reflect content importance, it allows zooming in and out into informational detail of a text's content (cf. Fidel, 1994). At
different levels of informational detail, one might discriminate text representations from others in the collection, or, if needed, group representations, Such an exhaustive text representation can combine different types of content representations (e.g., natural language and controlled language index terms, extracted words, phrases, and other informational units) (cf. Strzalkowski et al., 1997). New forms of text representations will certainly be tested in the future.