Statistical Processing - THE TEXT ANALYSIS STEP

THE CREATION OF TEXT SUMMARIES

3. THE TEXT ANALYSIS STEP

3.2 Statistical Processing

of individual sentences can be exploited to pinpoint topics that are most into focus, which can be used for identifying key text topics and sub-topics (Sidner, 1983; Kieras, 1985). Here also, cue words hint significant concepts (Paice & Jones, 1993) or provide the context for the thematic roles of certain phrases and clauses (Wendlandt & Driscoll, 1991). According to Kieras (1985), topic processing can proceed largely on the basis of limited knowledge of the semantics of the subject matter without an understanding of the passage content. This is a hypothesis that should be tested in practical systems. The thematic structure of texts is especially useful for generating abstracts that reflect the main topics and subtopics of texts. Knowledge of this structure is also important for producing abstracts at different levels of topic granularity.

The usefulness of discourse patterns in text summarization awakes the idea of representing texts by means of a text grammar (Paice, 1981, 1991;

Paice & Jones, 1993; Rama & Srinivasan, 1993). Texts like sentences have a kind of grammar, i.e., a set of implicit rules that writers and readers assume, that help govern the selection and ordering of elements in a discourse, and that make texts understandable to one another (cf. Reichman, 1985). The importance of discourse structure in text summarization was also stressed at the Spring Symposium on Intelligent Text Summarization (1998) organized by the American Association for Artificial Intelligence.

3.2.1 Identification of the topics of a text

In information retrieval research there is a long tradition of identifying words and phrases in a text that reflectits topics based on their distribution characteristics in the text and/or in a reference corpus (see chapter 4). The topical terms can form the basis of the text’s summary.

Significant words and phrases reflect a text's content and may serve well as crude abstracts (keyword abstracts) (Cohen, 1995). Phrases, especially noun phrases, are considered as important semantic carriers of the information content (Maeda et al., 1980; Kupiec et al., 1995). There are many techniques for word and phrase weighting in texts. Moreover, significant words and phrases help in identifying the relevant sentences that are retained for summary purposes. A simple, but still attractive approach extracts sentences that contain highly weighted terms possibly in close proximity (Luhn, 1958; Edmundson, 1969; Earl, 1970; Salton, 1989, p. 439 ff.). So, clusters of significant words within sentences are located and sentences are scored accordingly. A variant hereof considers query terms as the content terms around which the summary is built, i.e., highly weighted query terms in close proximity determine the sentences to be extracted (Tombros & Sanderson, 1998). This variant allows the summary being tailored to the need of a user.

Paragraph connections with minimum similarity α between a pair of paragraphs

Paragraph connections with minimum similarity β between a pair of paragraphs α >β

Figure 2. Paragraph grouping for theme recognition: A lower similarity threshold connects more paragraphs into a broader theme group.

Significant words and phrases also help in determining the thematic structure of a text and in extracting representative sentences or paragraphs of important text topics to form a summary.

There is a growing interest in identifying the thematic structure of a text based on its term distributions (Figure 2). Techniques concern the grouping of textual units (fixed number of words or units marked orthographically such as sentences or paragraphs) that have similar patterns of content terms.

This approach has been elaborated by Saltonand his co-researchers (Salton, Allan, Buckley, & Singhal, 1994; Salton et al., 1997). In their approach, paragraphs are grouped if there is sufficient overlap of their content terms.

Such a grouping may reveal the main topics of the text. The similarity between a pair of paragraphs is computed by applying the cosine function upon the vector representations of their content terms (cf. Jones & Furnas, 1987). Paragraphs are grouped if their mutual similarity exceeds a predefined threshold value. A threshold similarity value allows broadening or narrowing the grouping ideally allowing for hierarchically arranged contexts wherein users can zoom from one context to another (Salton et al., 1997). A similar course follows the research of Hearst and Plaunt (1993) (Hearst, 1997), which aims at detecting the subtopics of a text. Based upon the assumption that the main topics of an expository text occur throughout the text, and the subtopics only have a limited extent in the text, their system TextTiling automatically reveals the structure of subtopics. TextTiling computes the similarity of each two adjacent text units. The text units compared consist of about 3-5 sentences. The resulting sequence of similarity values is placed in a graph. The graph is smoothed and examined for peaks and valleys. Valleys in the graph identify ruptures in the topic structure. TextTiling has been applied to structure articles of a scientific journal according to subtopics.

Once the thematic structure is determined, it can be used to selectively extract important sentences or paragraphs from the text and traversing the extracted units in reading order to construct a text extract that serves as a summary. The idea goes back to Prikhod’ko and Skorokhod’ko (1982), who studied the importance of links between sentences in text summarization.

Each sentence is scored by the number of links (common content terms or concepts) with the other sentences of the text. Sentences the score of which surpasses a threshold are included in the abstract. This approach is based on the assumption that sentences related to a large number of other sentences are highly informative and are prime candidates for extraction. Recently, a few algorithms have been proposed to extract representative text paragraphs in order to form a readable and topically balanced abstract (Salton et al., 1997). The algorithms suggested have relatively poor results. When

compared with manual abstracts an overlap of maximum 46% is obtained. A best score is achieved with an algorithm that extracts paragraphs that are highly linked with other paragraphs or have a large overlap in terms of content terms with other paragraphs. In chapter 8 we discuss the shortcomings of these algorithms and propose alternative ones.

3.2.2 Learning the importance of summarization parameters Discourse patterns, including the distribution and linguistic signaling of highly topical sentences, may vary according to the document corpus or the text type. Also, when information is selected from the source to be included in a specific task-oriented summary, discourse patterns can have different weights. There are experiments in learning the value of discourse parameters. Kupiec et al. (1995) compute the weight of certain discourse patterns based upon an example text base and their abstracts. On the basis of a corpus of technical papers with abstracts written by professional abstractors, the system identifies those sentences in the text which also occur in the summary. It then acquires a model of the “abstract-worthiness’’ of a sentence as a combination of a limited number of properties or parameters of that sentence. Properties that are accounted for are: length of sentences, sentences containing indicator phrases, or sentences following section headings that contain indicator phrases, sentences in the first ten and the last five paragraphs, the first, final or medium sentences, sentences with frequent content words, and sentences with proper names that occur more than once.

A classification function (Bayesian independence classifier) (cf. chapter 5 (12-15)) is developed that estimates the probability that a given sentence is included in the abstract given the probability of its properties in the texts of the training base. Each sentence is described by a number of discourse patterns and the probability of inclusion in the summary is computed based on estimates of the probabilities of the patterns in example abstracts and example source texts. When abstracting a new text, its sentences are ranked according to this probability and a specified number of scoring sentences is selected. This approach offers a direct method for finding an optimal combination of selection heuristics based on discourse patterns. The summarizer has been tested on publications in the scientific-technical domain. The best results (43% correctness in correspondence with manually extracted sentences by professional abstractors) are obtained by a combination of location, cue phrase, and sentence length heuristics. The experiment is replicated by Teufel and Moens (1997), who demonstrate the usefulness of the approach for text analysis and selection in a summarization task. It might be noted that statistical independence of discourse patterns

employed in a Bayesian classifier is sometimes a false assumption. Recent attempts to use discriminant functions and techniques for inducing logical rules (e.g., C4.5 algorithm of Quinlan, 1993) (see chapter 5) in acquiring discourse patterns show encouraging learning performance (Mani &

Bloedorn, 1998).

From the above, it is clear that the statistical techniques offer opportunities to develop unsupervised as well as supervised techniques to learn discourse patterns and to avoid, at least partially, the knowledge acquisition step in text analysis. This is a promising research area. Parallel to this research, more must become known about the discourse patterns of source texts and about significant discourse parameters for text summarization.

Dalam dokumen AND ABSTRACTING (Halaman 161-165)