consistency and navigational effectiveness in hypertext systems (Ellis et al., 1996). The problem might be alleviated when the text writer acts as a document engineer and is responsible for assignment of content attributes and links. In this way, the writer of text defines possible text uses and navigation between texts (cf. Barrett, 1989; Frants, Shapiro, Voiskunskii, 1997, p. 137). Moreover, the document engineering is not always cost effective, especially when dealing with heterogeneous material such as text content. Because of a better accessibility of the information through document mark-ups, time is gained when searching information. However, extra time is needed to accurately assign mark-ups.
Hence, the document engineer could use some extra automatic support for assigning content attributes to texts at the time of document creation (Alschuler, 1989; Wright & Lickorish, 1989; Brown, Foote, Jones, Sparck Jones, & Young, 1995). Especially for large active document collections, such as news texts, intended for a heterogeneous audience, this might be beneficial (Allen, 1990).
8. THE NEED FOR BETTER AUTOMATIC
Figure 3. The importance of the text representations (r1. . .rn) in information retrieval and selection.
It is important for the user of a document collection to find documents or information that is relevant for his or her need. Even, if a user has no well- defined information need and wants to browse the document collection, he or she wants to be guided in his or her selection of documents. Information retrieval and filtering systems, question-answering systems, and browsing systems that operate upon textual documents all rely upon characterizations of their content (Figure 3). These text representations are the result of indexing and abstracting the texts. The text representations are matched with representations of the information need or guide the user in selecting relevant documents or information. The quality of the retrieved and selected information is becoming of increasing importance (Convey, 1992, p. 105).
The users of the still expanding electronic databases and libraries want to retrieve all relevant documents or information, but do not want to be overwhelmed with documents that are irrelevant or only marginally relevant
to their needs. The users of browsing systems want to be effectively directed towards interesting documents, without being submerged in possible choices. Currently, this is far from realized for textual databases. There is a real information (retrieval) problem. The problem is caused by incorrect and incomplete representations of an information need and of the content of document texts, and by a probabilistic matching between both.
Indexing commonly extracts from or assigns to the text a set of single words or phrases that function as key terms. Words or phrases of the text are commonly called natural language index terms. When the assigned words or phrases come from a fixed vocabulary, they are called controlled language index terms. The index terms, besides reflecting content, can be used as access points or identifiers of the text in the document collection. This form of text representation is used in information retrieval and filtering systems (Figure 3). Abstracting results in a reduced representation of the content of the text. The abstract usually has the form of a continuous, coherent text or of a profile that structures certain information of the text. Abstracts are mainly used in question-answering systems and browsing systems (Figure 3). Indexing and abstracting of the content of texts are traditionally manual tasks. In the growing document collections, the task of human indexing and abstracting is not feasible in terms of efficiency and cost. Moreover, the manual process is not always consistently done. However, current text representations that are automatically generated do not accurately and completely represent the content of texts. Better automatic indexing and abstracting techniques certainly contribute in resolving the information retrieval problem.
Other solutions to the information retrieval problem have been proposed with some success. We saw that full-text search, relevance feedback, information agents, and document engineering all contribute to more effective information retrieval and selection systems. We also demonstrated that each of these answers benefit from a more refined characterization of the content of texts.
Full-text search is the simplest form of automatic indexing. It is generally assumed that inferior results of a full-text search are due to poor automatic identification of good content terms in the texts. Relevance feedback will be improved when more selectively content is identified in the documents, which will be used in reformulating the query. Especially, when employing long documents in a feedback process, such a selection is necessary. The development of information agents goes hand in hand with the need for a more refined automatic characterization of the content of text. When learning a user’s profile, content features need to be identified in the document texts that are salient for the learning of the profile and that permit
comparisons with a detailed profile. Multimedia information systems are being developed worldwide, The content of each object in a multimedia system (including textual objects ) needs to be represented. Without such a representation, the system would not be able to integrate information from different media. At present, the representation of textual objects is done by intellectual attribution of key terms that should reflect the content, by intellectually linking text items that treat similar or related contents, or by intellectually creating abstracts that help in document selection. Here again, there is a need for an effective automatic characterization of the texts’
contents.
The above considerations all stress the need for more refined procedures for automatically indexing and abstracting texts. This brings us back to the point where we started the reasoning in this chapter. Natural language understanding of text is a difficult task. However, we feel that progress in content understanding is possible without relying upon a complete and complex processing of the texts aiming at their full understanding.
1. Progress can be made in defining the aboutness or the topics of a text.
Despite considerable improvements, we are still not perfect when automatically identifying the aboutness of a text. Ideally, a text should be represented by different levels of aboutness, allowing for a motivated zooming of its topics and subtopics (Lewis & Sparck Jones, 1996).
Aboutness is a permanent quality of the text and has proven in the past its usefulness in information selection. As a cognitive model of text comprehension, the Kintsch and van Dijk model (1978) has a potential for the automatic recognition of the aboutness of a text (Endres- Niggemeyer, 1989; Pinto Molina, 1995).
2. If indexing and abstracting techniques can correctly characterize the detailed topics including specific information in texts, the detailed topics might correspond to a certain need of a user at a specific moment, Presently, the words of the full-text are insufficiently powerful to capture such a detailed content.
3. We need better techniques for extracting content from text that relates to themeaningthat users may attach to the text (Fidel & Efthimiadis, 1994).
This seems a challenging task, but at least we can concentrate on those cases where texts are used with clear goals that are shared among a class of users (cf. Kintsch & Van Dijk, 1978). This refers to what Maron (1977) calls the retrieval aboutness, which is the meaning of a text to a class of users.
Of course, the challenge is to identify a text’s content without having to process it based upon a complete linguistic, domain-world, and contextual knowledge of the communication. We think that improvements are possible having a limited amount of knowledge added or having the knowledge automatically acquired. Using a minimum of knowledge sources in text understanding fits in with traditional research in automatic indexing and abstracting in the field of information retrieval. Document collections are often very heterogeneous and are composed of texts of different types and origins. We especially focus upon techniques for better identification and extraction of content terms, indexing of sections or passages, automated methods for assignment of subject codes, information extraction, and text summarization techniques (cf. Carbonell, 1996).
We conclude that there is an absolute need for refined techniques for automatically indexing and abstracting document texts. These techniques form the subject of this book.
1In this book we make the following distinction between the terms "data", "information" and
"knowledge" (cf. Pao, 1987, p. 10-11). Data are sets of symbols representing captured evidence of transactions and events. We use the term information for selected data. When we use the term knowledge, it refers to knowledge acquired by humans when executing a task or to knowledge as implemented in and employed by knowledge-based systems. The term “information retrieval” sometimes refers to information management in general, more often it refers to the retrieval of documents that satisfy a certain information need. The term is used in both Senses in this book.