The first part, "The Indexing and Abstracting Environment", places the problem in a broad context and defines important concepts in the book. In the third chapter entitled "The creation of highlights in magazine articles", the portability of text grammar to text.
THE INDEXING AND ABSTRACTING ENVIRONMENT
THE NEED FOR INDEXING AND ABSTRACTING TEXTS
- INTRODUCTION
- ELECTRONIC DOCUMENTS
- COMMUNICATION THROUGH NATURAL LANGUAGE TEXT
- UNDERSTANDING OF NATURAL LANGUAGE TEXT: THE COGNITIVE PROCESS
- UNDERSTANDING OF NATURAL LANGUAGE TEXT: THE AUTOMATED PROCESS
- IMPORTANT CONCEPTS IN INFORMATION RETRIEVAL AND SELECTION
- Aboutness and Meaning
- Relevance
- The Information Need
- The Information (Retrieval) Problem
Motivational relevance includes the purpose of the search and the intended use of the information. At the heart of information retrieval is the problem of assessing the value of the content of a given document in relation to a given information need.
GENERAL SOLUTIONS TO THE INFORMATION RETRIEVAL PROBLEM
- Full-Text Search and Retrieval
- Relevance Feedback
- Information Agents
- Document Engineering
- THE NEED FOR BETTER AUTOMATIC
Such a profile is called auser's model. The agent knows the user's interests, goals, habits, preferences and/or background, or gradually becomes more effective as it learns this profile (Maes, 1994; Koller & . Shoham, 1996). The use of such comments greatly benefits the accessibility of the information contained in and attached to the documents.
INDEXING AND ABSTRACTING TECHNIQUES
We also demonstrated that each of these responses benefits from a more refined characterization of the texts' content. The development of information agents goes hand in hand with the need for a more refined automatic characterization of text content.
THE ATTRIBUTES OF TEXT
THE STUDY OF TEXT
At the micro level of description, discourse analysis refers to the vocabulary, syntax and semantics of individual sentences, clauses and phrases (van Dijk, 1997). In addition to the properties of the text, discourse analysis examines the characteristics of the social situation of the communication event which.
AN OVERVIEW OF SOME COMMON TEXT TYPES
In addition to expository and narrative text, there are text types that are part of specific disciplines. Some other types of text with an entertaining feature (eg, journal articles) are interesting for automatic indexing, which facilitates the resulting automatic selection.
TEXT DESCRIBED AT A MICRO LEVEL
- Phonemes and Letters
- Morphemes
- Words
- Phrases
- Sentences
- Clauses
- Marks
Derived morphemes change the category of the base word (for example, "friendly": the derivation of an adjective from a noun). Also, one word can generalize or specify the meaning of the other word (for example, the word "apple" specifies the word "fruit").
TEXT DESCRIBED AT A MACRO LEVEL
- The Schematic Structure or Superstructure
- The Rhetorical Structure
- The Thematic Structure Definition
- The Communicative Goal
- Text Length
The subjects of a text are closely related to the linguistic phenomena on the surface of the text. The topics of a text are actually described by the words in the sentences of the text.
CONCLUSIONS
These structures, which order the content of the text, contribute to the successful realization of the communicative purpose. Communicative purpose is a property of the text from the point of view of the creator of the text.
TEXT REPRESENTATIONS AND THEIR USE
DEFINITIONS
Text indexing and abstracting is both a human intellectual process and an automated process. The term document representative can also refer to the product of text indexing and abstracting (van Rijsbergen, 1979, p. 14; Lewis, Croft and Bhandaru, . 1989).
REPRESENTATIONS THAT CHARACTERIZE THE CONTENT OF TEXT
- Set of Natural Language Index Terms
- Set of Controlled Language Index Terms
- Abstract
Because of the lack of fixed index terms, natural language index terms make a textual database portable and compatible across different document collections. A controlled vocabulary is essentially a predefined list of index terms constructed by some authority in relation to document collection management.
INTELLECTUAL INDEXING AND ABSTRACTING
- Gene ral
- Intellectual Indexing
- Intellectual Abstracting
Once the topics of the text have been identified, specific topics or information can be selected. In a next step, the identified content of the text is translated into a set of index terms.
USE OF THE TEXT REPRESENTATIONS
- Indicative and Informative Text Representations
- Information Retrieval Systems
- Ques ti on-Answering Systems
- Browsing Systems
In a pure Boolean model, sorting of documents by importance is not guaranteed. Term independence is assumed when evaluating the probability of a document's relevance to a query.
A NOTE ABOUT THE STORAGE OF TEXT REPRESENTATIONS
Defines a markup language for laying out and displaying hypertext and for defining links between text objects. A hypertext reference can be an anchor that indicates the position of text when the original referenced text is stored in the same file as its representation.
CHARACTERISTICS OF GOOD TEXT REPRESENTATIONS
In contrast to the above, a text representation is often a reduction of the original text content. It is not enough for a text representation to be a good description of the content of the source text.
CONCLUSIONS
AUTOMATIC INDEXING
THE SELECTION OF NATURAL LANGUAGE INDEX TERMS
A NOTE ABOUT EVALUATION
The selection of index terms in natural language is usually evaluated in an extrinsic way (cf. Extrinsic evaluation assesses the quality of the index terms based on how the index terms perform in another task.
LEXICAL ANALYSIS
Numbers in texts usually do not make good index terms and are often neglected. Letters are not usually significant in index terms, and all characters can be converted to either lowercase or uppercase.
USE OF A STOPLIST
It is also possible that a specialized text database contains words that are useless as index terms that are not frequent in the standard language or in the database. A more aggressive method for removing domain-specific stop words uses a collection of training texts and information about their context in the training set (Wilbur & Sirotkin, 1992; Yang & Wilbur, 1996).
STEMMING
However, the table becomes large when it takes into account terms in standard language and possibly terms in the specialized subject domain of the text corpus. For each possible initial sequence of letters of a word, the number of variant subsequent letters (separate letters) in the corpus is calculated.
THE SELECTION OF PHRASES
- Statistical Phrases
- Syntactic Phrases
- Normalization of Phrases
- Recognition of Proper Names
The method has two steps: identification of the classes (parts of speech) of the words of the text and recognition of combinations of word classes in the text. The tree indicates dependencies between the phrase components of the sentence (eg head and modifier of a phrase).
INDEX TERM WEIGHTING
- The General Process
- Classical Weighting Functions
The term frequency (tf) measures the frequency of occurrence of an index term in the document text (Salton & Buckley, 1988): tfi = frequency of occurrence of the index term in the text. The weight of a phrase component is usually calculated as the product of the term frequency (tf) (2) and the inverse document frequency (idf) (5) of the individual word.
ALTERNATIVE PROCEDURES FOR SELECTING INDEX TERMS
- The Multiple Poisson (nP) Model of Word Distribution
- The Role of Discourse Structure
But the mean of this Poisson process depends on the degree of topic coverage associated with the topic term. Thus, the distribution of a certain text term i in texts within the entire collection is governed by the sum of Poisson distributions, one for each class of topic coverage.
SELECTION OF NATURAL LANGUAGE INDEX TERMS: ACCOMPLISHMENTS AND
There is also a lot of research on the structural decomposition of texts according to different topics (Salton & Buckley, 1991; Hearst & Plaunt, 1993; Salton, Allan, Buckley & . Singhal, 1994; Salton, Singhal, Mitra & Buckley, 1997), which can be useful for identifying important thematic terms in texts.
PROBLEMS
CONCLUSIONS
The following chapters of this section describe alternative indexing and abstracting techniques that alleviate these problems. 2 Length normalization can be part of query and document matching, when similarity functions include a length normalization factor (e.g. dividing by the product of the Euclidean lengths of the vectors to be compared in the cosine function) (Jones & Furnas, . 1987 ).
THE ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS
THESAURUS TERMS
- The Function of Thesaurus Terms
- Thesaurus Construction and Maintenance
- Statistical methods
- Syntactic methods
The context of a word varies from the local context (e.g. words in the same sentence or surrounding sentences) and the whole text in which the word appears to the whole corpus (e.g. for distinguishing the meanings of words in short texts). This description can be used for disambiguation, for example by finding occurrences of words from the description in a document (Lesk, 1986, cited in Krovetz & Croft, 1992).
SUBJECT AND CLASSIFICATION CODES
- Text Categorization
- Text Classifiers with Manually Implemented Classification Patterns
- Text Classifiers that Learn Classification Patterns
A text classifier learns from a set of positive examples of the text class (texts that are relevant to the class) and possibly from a set of negative examples of the class (texts that are not relevant to the class). In the parametric training methods, the parameters are estimated from the training set by making an assumption about the mathematical functional form of the underlying population density distribution, such as a normal distribution.
LEARNING APPROACHES TO TEXT CATEGORIZATION
- Feature Selection and Extraction
- Feature selection
- Feature extraction
- Feature selection in text categorization
- Feature extraction in text categorization
- A note about cross validation
- Training with Statistical Methods
- Discrimination techniques
- An illustration: The Rocchio algorithm
- An illustration: The Widrow-Hoff algorithm
- Bayesian independence classifiers
- Learning of Rules and Trees
- Training with Neural Networks
The weight update rule minimizes the error using its gradient. The term 2( w.x -y) is the gradient (with respect to w) of the square. The high connectivity of the network (ie, the fact that there are many terms in the sum .) means that errors in some terms will probably be trivial.
ASSIGNMENT OF CONTROLLED LANGUAGE INDEX TERMS: ACCOMPLISHMENTS AND
More specifically, the deviation of each unit's output from its correct value for the case is propagated back through the network; all relevant connection weights and unity biases are adjusted to make the actual output closer to the target. Feature selection and extraction that rely on prior knowledge about the texts is recognized to be important in text classification.
CONCLUSIONS
In particular, it is difficult to automatically define the kind of relationship that holds between terms. Second, trainable text classifiers are often confronted with few positive examples of patterns to be learned.
AUTOMATIC ABSTRACTING
THE CREATION OF TEXT SUMMARIES
THE TEXT ANALYSIS STEP
- Deeper Processing
- The knowledge
- The parsing techniques
- The original models
- Other applications
- The significance of discourse structures
- Statistical Processing
- Identification of the topics of a text
- Learning the importance of summarization parameters Discourse patterns, including the distribution and linguistic signaling of
A complete analysis processes every word in the text and allows the word to contribute to the representation of the meaning of the text. Each sentence is scored by the number of links (common content terms or concepts) with the other sentences of the text.
THE TRANSFORMATION STEP
- Selection and Generalization of the Content
- Selection and Generalization of the Content of Multiple Texts
Selection and generalization of information also controls the length of the summary, i.e. the degree of compression of the original text. The length of a summary in relation to the length of the original text can vary (see Chapter 3, p. 60).
GENERATION OF THE ABSTRACT
It is important that the summary is a concise description of the content of the original document without losing clarity. Second, it is sometimes important that the summary follows the text of the original text as closely as possible (Endres-Niggemeyer, 1989).
TEXT ABSTRACTING: ACCOMPLISHMENTS AND PROBLEMS
On the other hand, very short sentences often need to be completed with an adjacent sentence to increase the clarity of the summary. In doing so, it is very important to start from good original presentations of the original texts.
APPLICATIONS
TEXT STRUCTURING AND CATEGORIZATION WHEN SUMMARIZING LEGAL CASES
- TEXT CORPUS AND OUTPUT OF THE SYSTEM
- METHODS: THE USE OF A TEXT GRAMMAR
- Knowledge Representation
- Parsing and Tagging of the Text
- RESULTS AND DISCUSSION
- CONTRIBUTIONS OF THE RESEARCH
Segments of the same hierarchical level may, but not necessarily, follow each other in the text. Complex patterns (combinations in propositional logic of simple patterns) classify the texts of the criminal cases.
CLUSTERING OF PARAGRAPHS WHEN SUMMARIZING LEGAL CASES
METHODS: THE CLUSTERING TECHNIQUES
We implemented the k-medoid method to classify the paragraphs of the court's opinion according to the topic (Figure 2). A brief example of eliminating redundant paragraphs in alleged crimes (translated from Dutch).1.
RESULTS AND DISCUSSION
The expert then linked each paragraph of the alleged offenses to the exact theme of the crime. This procedure was repeated for the opinion of the judicial part of the text of the case.