AND ABSTRACTING

The first part, "The Indexing and Abstracting Environment", places the problem in a broad context and defines important concepts in the book. In the third chapter entitled "The creation of highlights in magazine articles", the portability of text grammar to text.

THE INDEXING AND ABSTRACTING ENVIRONMENT

THE NEED FOR INDEXING AND ABSTRACTING TEXTS

INTRODUCTION
ELECTRONIC DOCUMENTS
COMMUNICATION THROUGH NATURAL LANGUAGE TEXT
UNDERSTANDING OF NATURAL LANGUAGE TEXT: THE COGNITIVE PROCESS
UNDERSTANDING OF NATURAL LANGUAGE TEXT: THE AUTOMATED PROCESS
IMPORTANT CONCEPTS IN INFORMATION RETRIEVAL AND SELECTION

Aboutness and Meaning
Relevance
The Information Need
The Information (Retrieval) Problem

Motivational relevance includes the purpose of the search and the intended use of the information. At the heart of information retrieval is the problem of assessing the value of the content of a given document in relation to a given information need.

Figure 1. Document and information selection tools.

GENERAL SOLUTIONS TO THE INFORMATION RETRIEVAL PROBLEM

Full-Text Search and Retrieval
Relevance Feedback
Information Agents
Document Engineering
THE NEED FOR BETTER AUTOMATIC

Such a profile is called auser's model. The agent knows the user's interests, goals, habits, preferences and/or background, or gradually becomes more effective as it learns this profile (Maes, 1994; Koller & . Shoham, 1996). The use of such comments greatly benefits the accessibility of the information contained in and attached to the documents.

INDEXING AND ABSTRACTING TECHNIQUES

We also demonstrated that each of these responses benefits from a more refined characterization of the texts' content. The development of information agents goes hand in hand with the need for a more refined automatic characterization of text content.

Figure 3. The importance of the text representations (r 1 . . .r n ) in information retrieval and selection.

THE ATTRIBUTES OF TEXT

THE STUDY OF TEXT

At the micro level of description, discourse analysis refers to the vocabulary, syntax and semantics of individual sentences, clauses and phrases (van Dijk, 1997). In addition to the properties of the text, discourse analysis examines the characteristics of the social situation of the communication event which.

AN OVERVIEW OF SOME COMMON TEXT TYPES

In addition to expository and narrative text, there are text types that are part of specific disciplines. Some other types of text with an entertaining feature (eg, journal articles) are interesting for automatic indexing, which facilitates the resulting automatic selection.

TEXT DESCRIBED AT A MICRO LEVEL

Phonemes and Letters
Morphemes
Words
Phrases
Sentences
Clauses
Marks

Derived morphemes change the category of the base word (for example, "friendly": the derivation of an adjective from a noun). Also, one word can generalize or specify the meaning of the other word (for example, the word "apple" specifies the word "fruit").

TEXT DESCRIBED AT A MACRO LEVEL

The Schematic Structure or Superstructure
The Rhetorical Structure
The Thematic Structure Definition
The Communicative Goal
Text Length

The subjects of a text are closely related to the linguistic phenomena on the surface of the text. The topics of a text are actually described by the words in the sentences of the text.

Table 1. Macro level of text description: Text structures and their main signaling cues.

CONCLUSIONS

These structures, which order the content of the text, contribute to the successful realization of the communicative purpose. Communicative purpose is a property of the text from the point of view of the creator of the text.

TEXT REPRESENTATIONS AND THEIR USE

DEFINITIONS

Text indexing and abstracting is both a human intellectual process and an automated process. The term document representative can also refer to the product of text indexing and abstracting (van Rijsbergen, 1979, p. 14; Lewis, Croft and Bhandaru, . 1989).

REPRESENTATIONS THAT CHARACTERIZE THE CONTENT OF TEXT

Set of Natural Language Index Terms
Set of Controlled Language Index Terms
Abstract

Because of the lack of fixed index terms, natural language index terms make a textual database portable and compatible across different document collections. A controlled vocabulary is essentially a predefined list of index terms constructed by some authority in relation to document collection management.

INTELLECTUAL INDEXING AND ABSTRACTING

Gene ral
Intellectual Indexing
Intellectual Abstracting

Once the topics of the text have been identified, specific topics or information can be selected. In a next step, the identified content of the text is translated into a set of index terms.

Figure 1. Intellectual indexing and abstracting.

USE OF THE TEXT REPRESENTATIONS

Indicative and Informative Text Representations
Information Retrieval Systems
Ques ti on-Answering Systems
Browsing Systems

In a pure Boolean model, sorting of documents by importance is not guaranteed. Term independence is assumed when evaluating the probability of a document's relevance to a query.

Figure 2. Actualization of an information need.

A NOTE ABOUT THE STORAGE OF TEXT REPRESENTATIONS

Defines a markup language for laying out and displaying hypertext and for defining links between text objects. A hypertext reference can be an anchor that indicates the position of text when the original referenced text is stored in the same file as its representation.

CHARACTERISTICS OF GOOD TEXT REPRESENTATIONS

In contrast to the above, a text representation is often a reduction of the original text content. It is not enough for a text representation to be a good description of the content of the source text.

CONCLUSIONS

AUTOMATIC INDEXING

THE SELECTION OF NATURAL LANGUAGE INDEX TERMS

A NOTE ABOUT EVALUATION

The selection of index terms in natural language is usually evaluated in an extrinsic way (cf. Extrinsic evaluation assesses the quality of the index terms based on how the index terms perform in another task.

LEXICAL ANALYSIS

Numbers in texts usually do not make good index terms and are often neglected. Letters are not usually significant in index terms, and all characters can be converted to either lowercase or uppercase.

USE OF A STOPLIST

It is also possible that a specialized text database contains words that are useless as index terms that are not frequent in the standard language or in the database. A more aggressive method for removing domain-specific stop words uses a collection of training texts and information about their context in the training set (Wilbur & Sirotkin, 1992; Yang & Wilbur, 1996).

STEMMING

However, the table becomes large when it takes into account terms in standard language and possibly terms in the specialized subject domain of the text corpus. For each possible initial sequence of letters of a word, the number of variant subsequent letters (separate letters) in the corpus is calculated.

THE SELECTION OF PHRASES

Statistical Phrases
Syntactic Phrases
Normalization of Phrases
Recognition of Proper Names

The method has two steps: identification of the classes (parts of speech) of the words of the text and recognition of combinations of word classes in the text. The tree indicates dependencies between the phrase components of the sentence (eg head and modifier of a phrase).

INDEX TERM WEIGHTING

The General Process
Classical Weighting Functions

The term frequency (tf) measures the frequency of occurrence of an index term in the document text (Salton & Buckley, 1988): tfi = frequency of occurrence of the index term in the text. The weight of a phrase component is usually calculated as the product of the term frequency (tf) (2) and the inverse document frequency (idf) (5) of the individual word.

ALTERNATIVE PROCEDURES FOR SELECTING INDEX TERMS

The Multiple Poisson (nP) Model of Word Distribution
The Role of Discourse Structure

But the mean of this Poisson process depends on the degree of topic coverage associated with the topic term. Thus, the distribution of a certain text term i in texts within the entire collection is governed by the sum of Poisson distributions, one for each class of topic coverage.

SELECTION OF NATURAL LANGUAGE INDEX TERMS: ACCOMPLISHMENTS AND

There is also a lot of research on the structural decomposition of texts according to different topics (Salton & Buckley, 1991; Hearst & Plaunt, 1993; Salton, Allan, Buckley & . Singhal, 1994; Salton, Singhal, Mitra & Buckley, 1997), which can be useful for identifying important thematic terms in texts.

PROBLEMS

CONCLUSIONS

The following chapters of this section describe alternative indexing and abstracting techniques that alleviate these problems. 2 Length normalization can be part of query and document matching, when similarity functions include a length normalization factor (e.g. dividing by the product of the Euclidean lengths of the vectors to be compared in the cosine function) (Jones & Furnas, . 1987 ).

THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS

THESAURUS TERMS

The Function of Thesaurus Terms
Thesaurus Construction and Maintenance

Statistical methods
Syntactic methods

The context of a word varies from the local context (e.g. words in the same sentence or surrounding sentences) and the whole text in which the word appears to the whole corpus (e.g. for distinguishing the meanings of words in short texts). This description can be used for disambiguation, for example by finding occurrences of words from the description in a document (Lesk, 1986, cited in Krovetz & Croft, 1992).

SUBJECT AND CLASSIFICATION CODES

Text Categorization
Text Classifiers with Manually Implemented Classification Patterns
Text Classifiers that Learn Classification Patterns

A text classifier learns from a set of positive examples of the text class (texts that are relevant to the class) and possibly from a set of negative examples of the class (texts that are not relevant to the class). In the parametric training methods, the parameters are estimated from the training set by making an assumption about the mathematical functional form of the underlying population density distribution, such as a normal distribution.

LEARNING APPROACHES TO TEXT CATEGORIZATION

Feature Selection and Extraction

Feature selection
Feature extraction
Feature selection in text categorization
Feature extraction in text categorization
A note about cross validation

Training with Statistical Methods

Discrimination techniques
An illustration: The Rocchio algorithm
An illustration: The Widrow-Hoff algorithm
Bayesian independence classifiers

Learning of Rules and Trees
Training with Neural Networks

The weight update rule minimizes the error using its gradient. The term 2( w.x -y) is the gradient (with respect to w) of the square. The high connectivity of the network (ie, the fact that there are many terms in the sum .) means that errors in some terms will probably be trivial.

ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS: ACCOMPLISHMENTS AND

More specifically, the deviation of each unit's output from its correct value for the case is propagated back through the network; all relevant connection weights and unity biases are adjusted to make the actual output closer to the target. Feature selection and extraction that rely on prior knowledge about the texts is recognized to be important in text classification.

CONCLUSIONS

In particular, it is difficult to automatically define the kind of relationship that holds between terms. Second, trainable text classifiers are often confronted with few positive examples of patterns to be learned.

AUTOMATIC ABSTRACTING

THE CREATION OF TEXT SUMMARIES

THE TEXT ANALYSIS STEP

Deeper Processing

The knowledge
The parsing techniques
The original models
Other applications
The significance of discourse structures

Statistical Processing

Identification of the topics of a text
Learning the importance of summarization parameters Discourse patterns, including the distribution and linguistic signaling of

A complete analysis processes every word in the text and allows the word to contribute to the representation of the meaning of the text. Each sentence is scored by the number of links (common content terms or concepts) with the other sentences of the text.

Figure 2. Paragraph grouping for theme recognition: A lower similarity threshold connects more paragraphs into a broader theme group

THE TRANSFORMATION STEP

Selection and Generalization of the Content
Selection and Generalization of the Content of Multiple Texts

Selection and generalization of information also controls the length of the summary, i.e. the degree of compression of the original text. The length of a summary in relation to the length of the original text can vary (see Chapter 3, p. 60).

Figure 3. Summarization of multiple texts (T 1 ... T 5 ).

GENERATION OF THE ABSTRACT

It is important that the summary is a concise description of the content of the original document without losing clarity. Second, it is sometimes important that the summary follows the text of the original text as closely as possible (Endres-Niggemeyer, 1989).

TEXT ABSTRACTING: ACCOMPLISHMENTS AND PROBLEMS

On the other hand, very short sentences often need to be completed with an adjacent sentence to increase the clarity of the summary. In doing so, it is very important to start from good original presentations of the original texts.

APPLICATIONS

TEXT STRUCTURING AND CATEGORIZATION WHEN SUMMARIZING LEGAL CASES

TEXT CORPUS AND OUTPUT OF THE SYSTEM
METHODS: THE USE OF A TEXT GRAMMAR

Knowledge Representation
Parsing and Tagging of the Text

RESULTS AND DISCUSSION
CONTRIBUTIONS OF THE RESEARCH

Segments of the same hierarchical level may, but not necessarily, follow each other in the text. Complex patterns (combinations in propositional logic of simple patterns) classify the texts of the criminal cases.

Figure 1. Architecture of the SALOMON demonstrator.

CLUSTERING OF PARAGRAPHS WHEN SUMMARIZING LEGAL CASES

METHODS: THE CLUSTERING TECHNIQUES

We implemented the k-medoid method to classify the paragraphs of the court's opinion according to the topic (Figure 2). A brief example of eliminating redundant paragraphs in alleged crimes (translated from Dutch).1.

RESULTS AND DISCUSSION

The expert then linked each paragraph of the alleged offenses to the exact theme of the crime. This procedure was repeated for the opinion of the judicial part of the text of the case.