TEXT REPRESENTATIONS AND THEIR USE
4. INTELLECTUAL INDEXING AND ABSTRACTING
Relation with text indexing
Text indexing and abstracting are closelyrelated(Lancaster, 1991, p. 5 ff.; Sparck Jones, 1993; Sparck Jones & Galliers, 1996, p. 28). The abstractor writes a narrative description of the content of a document text, while the indexer describes its content by using one or several index terms.
But, the many forms of abstracts make this distinction more and more blurred. A brief summary may serve as a complex structured index description, which provides access to the text collection, while a list of key terms may serve as a simple form of abstract. Many forms of text representations are intermediate forms of indexing descriptions and abstracts, Abstracts are supposed to be more exhaustive in representing content than an indexing description (Cleveland & Cleveland, 1990, p. 105;
Stadnyk & Kass, 1992).
4. INTELLECTUAL INDEXING AND
Indexing or abstracting involves three major steps (Lancaster, 1991, p. 8 ff.) (Figure 1). First, there is the conceptual analysis of a source text and the identification of its content (content analysis). Indexing as well as abstracting is always reducing the content to its essentials and often involves selection and generalization of information, which form the second step of the process. Thirdly, there is the translationof the selected and generalized content into the language of the text representation, i.e., a particular vocabulary of index terms or a summary text. Content identification and selection of information are not always distinct steps.
4.2 Intellectual Indexing
There are many guidelines for intellectual indexing (Borko & Bernier, 1978; Rowley, 1988; Cleveland & Cleveland, 1990; Lancaster, 1991).
Content analysis
When indexing with terms extracted from or assigned to a text, the indexer usually does not perform a complete reading of the document text. A combination of reading and skimming is advocated. The parts to be carefully read are those likely to tell the most about the contents in the shortest period of time (e.g., summary, conclusions, abstract, opening paragraphs of sections, opening and closing sentences of paragraphs, illustrations, diagrams, tables and their captions). These salient sections are often cued by the schematic structure of the text. The rest of the text is usually skimmed to ensure that the more condensed parts give an accurate picture of what the text is about.
An important aspect of content identification is identifying the subjects of the text. Indexers have guidelines for the analysis of thesubject content(the topics or aboutness) (Hutchins, 1985). Indexers must especially be aware of the linguistic cues that signal the thematic structure of a text on a micro as well as on a macro level (cf. chapter 2). On a macro level the notion of topic appears to be related to a text paragraph that has most links to other paragraphs. Or, a topic often appears in the first sentence of a paragraph. On a micro level, it is suggested that the theme-rheme articulations of sentences provide clues to the global topics of a text. A topic is also signaled by a noun phrase that numerous times appears as the subject of a sentence. It is also suggested that indexers first scan texts for particular words or phrases (e.g.,
“were killed” in the domain of terrorism) (Hutchins, 1985; Riloff & Lehnert, 1994). Then as a second step, the reader needs sometimes to evaluate the
context of the expression in case of semantic ambiguity (e.g., the context
“soldiers were killed”, is not anymore consistent with the terrorist domain, since victims of terrorists must be civilian).
Selection and generalization
Once the topics of the text are identified, specific topics or information can be selected. The topics can be replaced by more general concepts.
Translation of content into index terms
In a next step the identified content of the text is translated in a set of index terms. These index terms are natural language terms extracted from the text or controlled language terms selected from a classification scheme.
Indexers identifynatural language terms in the document text, when they feel that they accurately reflect the identified content. Presumably, they are influenced by the frequency by which a content word or phrase appears in the text, by the location of its appearance (e.g., in title, in summary, in captions to illustrations) and its context (Lancaster, 1991, p. 221). Usually, indexers feel good with such a practice, which is carried out rapidly decreasing the cost of indexing. But, the guidelines are often insufficiently precise to govern the indexer's choice of appropriate subject terms from the text so that even trained indexers become inconsistent in their selection of terms (Blair & Maron, 1985).
More frequently, indexers assign controlled language index terms to document texts. Beghtol (1986) has described this cognitive process. It first requires the design of a classification system of index terms or category labels that will be imposed upon the documents. The actual indexing process is the mapping of natural language surface expressions of the text into the appropriate classificatory notations or index terms according to the indexer's perception of the text's content. The concept expressed by the natural language expression must be sufficiently important. So, the indexer would assign an index term to a combination of words or phrases that tend to occur frequently in the document text (Lancaster, 1991, p. 225). This sounds simple, but the concepts expressed by the controlled language index terms often occur in many variant combinations of words and phrases with variant co-occurrence frequencies. For instance, if “AIDS” occurs 20 times in a journal article, the index term “AIDS” should almost certainly be assigned.
Suppose on the other hand, that “AIDS” occurs only twice in the document, but “human immunodeficiency virus” occurs a few times and “viral infection” occurs rather frequently. Then, the term “AIDS” could also be
assigned. Another example illustrates the importance of co-occurrence frequencies. If the words “heat”, “lake”, and “pollution” all occur a few times in a document, this might be enough to cause the terms “thermal pollution” and “water pollution”, to be assigned. But, “heat” and “lake”
without the appearance of “pollution” would have to occur together in a document many times before “thermal pollution” would be a good bet for assignment. It is interesting to note that indexers sometimes reason by appealing to the similarity of new and old instances of texts. So, when assigning controlled language index terms, they look for textual patterns that occur in texts previously classified by these labels and assign the terms when sufficient similarity between the old and the new texts is present (Hayes- Roth & Hayes-Roth, 1977).
Indexers may attribute a weight to the natural and controlled language index terms based upon their judgement of term importance.
4.3 Intellectual Abstracting
Because the ability to summarize information is a necessary part of text understanding and text production, the work of Kintsch and van Dijk regarding text comprehension and production is important to unravel the intellectual process of abstracting (Kintsch & van Dijk, 1978; van Dijk &
Kintsch, 1983). Many models and guidelines for intellectual abstracting exist (Borko & Bernier, 1975; Hutchins, 1987; Rowley, 1988; Lancaster, 1991;
Pinto Molina, 1995; Cremmins, 1996; Endres-Niggemeyer & Neugebauer, 1998). Some of them are based upon the findings of Kintsch and van Dijk.
Figure 1. Intellectual indexing and abstracting.
Content analysis
Content identification for abstracting is very similar to the intellectual process of indexing. The professional abstractor learns to skim a text to identify the salient points quickly, followed by a more detailed reading of some key sections. The schematic structure of a text hints salient sections.
The guidelines for making the summaries often refer to specific text types and their superstructure. A content analysis for abstracting goes into more informational detail than when indexing with terms. But, this of course is also dependent upon the type of abstract that is to be realized.
Selection and generalization
The Kintsch and van Dijk model of text comprehension (Kintsch & van Dijk, 1978; van Dijk & Kintsch, 1983) emphasizes the significance of the thematic structure when selecting topical information, and stresses the importance of generalizing a text’s content. In this model the topics of a text are derived by applying different rules. The first regards deletion of unnecessary and irrelevant information (e.g., detailed descriptions, background information, redundant information, and common knowledge).
The second bears upon selection by extracting the necessary and relevant information (e.g., information in key sections, thematic sentence selection).
The selected topic segments then are stated in the form of propositions. The third rule of their model of summarization regards generalization and defines the construction of general propositions from the more specific ones. For instance, from the propositions that describe girls playing with dolls and boys playing with train sets, a description is derived of children playing with toys. A fourth rule, which is necessary in narrative texts, replaces sequences of propositions by single propositions expressing self-contained events.
When summarizing the topics of a text, it is important to retain the topic emphases of the original and to make clear a distinction between the major and minor topics.
Summary production
Professional abstracting involves translating the selected and generalized content into a coherent and clear summary. This step is absent when the summary consists of phrases, sentences, or other textual units extracted from the original text.
The major concern is brevity and readability of the summary (Rowley, 1988, p. 25 ff.; Lancaster, 1991, p. 97 ff.). Usually, abstracters make a draft
that is revised and improved with the help of checklists. However, a complete reformulation of the selected information is not always desired, because of the danger of distorting the meaning of the original text (Endres- Niggemeyer, 1989). When the full-text of the abstract is used as a document surrogate in search engines, another concern is the searchability of the abstract. For instance, it is advised that it contains many unambiguous content terms and their synonyms (Rowley, 1988, p. 31; Lancaster &
Warner, 1993, p. 88).
There are guidelines for the length of an abstract. When the abstract is a coherent text, its length is defined by different factors. The most important one is the amount ofinformational detail of the content of the source that will be provided by the abstract. A second factor is thelength of the original text.When the abstract is a balanced picture of the most important content of the text, an ideal length is 10% to 15% of the original (Edmundson, 1964;
Borko & Bernier, 1975, p. 69; Tombros & Sanderson, 1998), or 20% to 30%
of the original when more informational details are needed (Brandow, Mitze,
& Rau, 1995). On the other hand, when the abstract only highlights specific information, the abstract may be very brief. Sometimes, a more or less fixed length is imposed, such as a minimum and maximum upon the number of sentences (Edmundson, 1969; Paice, 1981; Brandow et al., 1995; Tombros
& Sanderson, 1998), of words (Lancaster, 1991, p. 101), or of paragraphs contained in the summary (Lancaster, 1991, p. 101). Finally, the length of the abstract is determined by its intellectual accessibility. Some texts might be more compactly condensed then others while leaving the comprehensibility of the abstract undisturbed.