THE CREATION OF TEXT SUMMARIES
6. TEXT ABSTRACTING: ACCOMPLISHMENTS AND PROBLEMS
Often, an unordered set of index terms cannot accurately represent the content of texts. The richer semantic representation of an abstract compared to the “bag of words” representation in case of indexing has definite advantages despite its more complex computation (see chapter 3). The value of automatic text abstracting is not questioned, but it is important to create abstracts that are true reflections of the content of texts and that are useful in the task that they are intended for. This is still problematic (cf. Edmundson,
1964), but there are promising directions to pursue.
1. The field of information extraction offers valuable solutions to identify information in texts. However, the proposed solutions heavily depend on
external knowledge, especially domain knowledge. Because of the knowledge acquisition bottleneck, successful applications operate in restricted subject domains, When knowledge of the domain model is cheap to acquire manually, this approach can be encouraged (cf. Cowie
& Lehnert, 1996). But, there is an emerging interest to concentrate upon generic knowledge in text abstracting such as linguistic knowledge and especially the discourse patterns of whole texts. Classical natural language parsing of the text might not yield a complete understanding of the text, but it may yield enough predictions for text abstracting. More importantly, discourse analysis has explored many discourse phenomena for spoken dialogue as well as written text, and has in particular proposed models of text structure and structural relations that appear especially relevant to summarizing. Research on text typology clearly is significant, either because different genres may require different abstracting strategies, or perhaps there are general genre-independent abstracting strategies to be discovered. Discourse structure is important for locating salient pieces of information in texts. It is generally agreed that we need more linguistic and social-cultural studies on the nature of discourse and text (cf. Endres-Niggemeyer, 1989; Sparck Jones & Endres-Niggemeyer,
1995; Moens et al., 1999b). Besides integrating more generic knowledge, it is also important to develop tools that facilitate the implementation of knowledge across subject domains and text typologies. Finally, there is the interest in automatically acquiring the linguistic, especially the discourse patterns, as well as the domain-dependent knowledge. The learning techniques form a bridge with the more general statistical techniques of the next paragraph.
2. The discipline of information retrieval traditionally exploits statistical techniques for content identification in texts of broad, unrestricted domains. Additionally, some of the techniques that are recently developed for recognition of thematic structures in texts have a potential for automatic text abstracting. Also, the supervised techniques for learning classification patterns are promising. They are useful for acquiring the typical discourse patterns of document collections, or for learning domain concepts.
3. The transformation step in which a source text representation is reduced to form the summary representation is too often restricted to an information selection process. Replacing the concepts of a source text by more general concepts in the summary text is rather neglected in automatic text abstracting.
4. Human summarizing as a professional activity has practices and guidelines that are useful as a source of summarizing models. For
instance, psychological studies of discourse reading and its retention in memory as evidenced by summarizing, can throw further light on text features that are remembered or on the properties of a text that serve to identify what is important to it.
5. There is a growing interest in abstracts of multiple texts. Comparative abstracts of multiple texts are especially beneficial for accessing large document collections. Here, it is very important to start from good source representations of the original texts. Additionally, developing statistical techniques that recognize similarities and differences between the source representations is a promising research area.
6. In order to tailor abstracts to specific needs, we need more studies about how abstracts can be used in text retrieval and other related text-based tasks and how the use determines form and content of the summary (cf.
Sparck Jones & Endres-Niggemeyer, 1995).
7. A final problem concerns evaluation of the generated abstracts. More research is needed to develop suitable effectiveness measures.
7. CONCLUSIONS
Summarization is crucial to information and knowledge organization.
Automatic abstracting is a good solution formanaging a textual information overload. The abstract that is automatically generated, despite being an approximation of the ideal one, is very valuable in document and information selection from large collections.
In this chapter we emphasized text analysis and selection of salient information from the original text. The techniques fall into two classes.
There are the ones that rely heavily on symbolic knowledge, produce good quality abstracts, but are often tied to a specific application. On the other hand, there are the more general techniques that statistically process distribution patterns of words, but produce less accurate abstracts. Learning techniques bridge the gap between the domain-specific stronger methods, and the more general, but weaker methods. Several research strategies have been proposed, some of which will be explored in the following chapters. On one hand, we need more studies of discourse and text in order to generate cohesive, properly covered and balanced abstracts at different levels of informational detail. On the other hand, the development of statistical programs for pattern recognition is important for acquiring the discourse patterns, especially the domain- and/or collection dependent text patterns.
These techniques might include supervised as well as unsupervised learning algorithms.