7.
In this section we explain a number of major strategies that have been implemented and still are being developed to relieve the above information retrieval problem.
7.1 Full-Text Search and Retrieval
The basic concept of full-text search and retrieval is storing the full-text of all documents in the collection so that every word in the text is searchable and can function as keys for retrieval. Then, when a person wants information from the stored collection, the computer is instructed to search for all documents containing certain user’s specified words or word combinations. This approach contrasts with searching collections that have fixed descriptors attached to the document texts.
The original idea (Swanson, 1960) was positively tested by Salton (1970) and since then implementation of full-text retrieval gained more and more success. Today, the full-text segment is still a growing section of the commercial computerized database market (Sievert, 1996).
Full-text search is attractive for many reasons and has some definite advantages.
1. Full-text search is attractive from the commercial point of view (Blair &
Maron, 1985). Digital technology provides cheap storage for full-text and supplies fast computational technology making searching of full-text efficient. It is also very convenient to search different text types in large document collections just by searching individual words. Additionally, as it employs a simple form of automatic indexing, it avoids the need for human indexers, whose employment is increasingly costly and whose work often appears inconsistent and less fully effective.
2. Full-text search is a first attempt to transfer indexing from a primarily a priori process, to a process determined by specific information needs and other situational factors (Tenopir, 1985; Salton, 1986). Fixed text descriptors severely hamper the accessibility of the texts. Sometimes documents are not retrievable relying on assigned descriptors, because their information value to the users is peripheral to their main focus.
Indexing of concepts and terms in a full-text search is situation dependent and would be performed according the requirements of each incoming request.
3. Inexperienced users found that searching with natural language terms in the full-text was easier than searching with fixed text descriptors (Tenopir, 1985).
Still, full-text search is not a magical formula and it suffers from shortcomings.
1. While recall is generally enhanced compared to the use of fixed text descriptors (Tenopir, 1985; McKinin, Sievert, Johnson, & Mitchell, 199 1), when searching large document collections, precision may suffer intolerably and users might be swamped with irrelevant material (Blair &
Maron, 1985; Blair & Maron, 1990). The occurrence of a word or word combination is no guarantee for relevance. As databases grow, this “too many hits” problem will only be exacerbated. This is currently the case with full-text searches on the Internet.
2. Also recall may suffer. A survey by Croft, Krovetz, and Turtle (1990) indicates that users often query documents in terms that they are familiar with, and these terms are frequently not the terms used in the document itself. This shortcoming is still more prominent, when combinations of search terms are used that need to occur together in documents (Blair &
Maron, 1985). If the occurrences of these terms in a relevant document are independent events, the probability of finding documents that contain the exact term combination decreases as the number of search terms in the combination increases.
In the past years, research on full-text retrieval has increased dramatically because of the yearly TREC (Text REtrieval Conference) conferences sponsored by the NIST (National Institute of Standards and Technology, USA). The TREC conferences reflect the need for a more refined automatic indexing of the content of texts as an answer to the shortcomings of current full-text search (see Harman, 1993, 1994, 1995, 1996; Voorhees & Harman,
1997, 1998, 1999).
7.2 Relevance Feedback
An important and difficult operation in information retrieval is generating useful query statements that can extract all the relevant documents wanted by the users and reject the remainder. Because an ideal query representation cannot be generated without knowing a great deal about the composition of the document collection, it is customary to conduct searches iteratively, first operating with a tentative query formulation, and then improving
formulations for subsequent searches based on the evaluations of previously retrieved materials. One method for automatically generating improved query formulations is the well-knownrelevance feedback process.
Methods using relevance information have been studied for decades and are still investigated. Rocchio (1971) was the first to experiment with query modification and with positive results. Ide (1971) extended Rocchio’s work.
Salton and Buckley (1990) compared this work across different test collections. Relevance feedback is extensively studied in the Text REtrieval Conferences (TREC).
The main assumption behind relevance feedback is that documents relevant to a particular query resemble each other. This implies that, when a retrieved document has been identified as relevant to a given query, the query formulation can be improved by increasing its similarity to such a previously retrieved relevant item. The reformulated query is expected to retrieve additional relevant items that are similar to the originally identified relevant item. Analogously, by reformulating the query, its similarity with retrieved non-relevant documents can be decreased.
So, a better query is learned by judging retrieved documents as relevant or non-relevant. The original query can be altered in two substantial ways (Salton, 1989, p. 307). First, index terms present in previously retrieved documents that have been identified as relevant to the user’s query are added to the original query formulation. Second, using the occurrence characteristics of the terms in the previously retrieved relevant and non- relevant documents of the collection allows altering the weight of the original query terms. The weight or importance of query terms occurring in relevant documents is increased. Analogously, terms included in previously retrieved non-relevant documents could be de-emphasized. Both approaches have yielded improved retrieval results (Salton & Buckley, 1990; Harman, 1992b). Experiments indicate that performing multiple iterations of feedback until the user is completely satisfied with the results, is highly desirable.
Relevance feedback is used both in ad-hoc interactive information retrieval and document filtering based on long-term information needs.
Although relevance feedback is considered as being effective in improving retrieval performance, there are still some obstacles. One should be selective of which terms to add to the query formulation (Harman, 1992b) and the weights of which terms of the query formulation to alter (Buckley &
Salton, 1995). Moreover, current text collections often contain large documents that span several subject areas. It has been shown that trimming large documents by selecting a good passage when selecting index terms, has a positive impact on feedback effectiveness (Allan, 1995).
7.3 Information Agents
There are many definitions of the concept “agent” (we refer here to Bradshaw, 1997, p. 3 ff,). A crude definition is that an agent is software that through its imbedded knowledge and/or learned experience can perform a task continuously and with a high degree of autonomy in a particular environment, often inhabited by other agents and processes (cf. Shoham, 1997). There is an emerging interest in the engagement of information agents (Croft, 1987; Standera, 1987, p. 217 ff.; Maes, 1994; Koller &
Shoham, 1996). An information agent supplies a user with relevant information that is for instance drawn from a collection of documents.
The main goal of employing an information agent in information selection and retrieval is to determine the user’s real need and to assist in satisfying this need. However, there is a growing interest in agents that identify or learn appropriate content attributes of texts.
1. A typical task in an information retrieval environment is filtering of information according to a profile of a user or a class of users (Allen, 1990). Such a profile is called auser’s model.The agent knows the user’s interests, goals, habits, preferences and/or background, or gradually becomes more effective as it learns this profile (Maes, 1994; Koller &
Shoham, 1996). The knowledge in the profile is intellectually acquired (from the user and experts), implemented and maintained by knowledge engineers, Or, theknowledgeis learnedby the agent itself based on good positive (and negative) training examples. Learning a user’s profile has multiple advantages, including the avoidance of costly implementation and maintenance, and easy adaptations to changing preferences. Learning of users’ preferences is closely related to the technique of relevance feedback. Again, such an approach assumes the relevancy of documents that are similar to previously retrieved documents found relevant.
2. Information agents also perform other functions, which support the retrieval operation. They can provide the services of a thesaurus, such as providing synonyms to query terms or supplying broader or narrower terms for the query terms (Wellman, Durfee, & Birmingham, 1996; see chapter 5). An agent can also select the best search engine based upon knowledge of search techniques.
3. Research on information agents especially focuses upon the characterization and refinement of the information need. It is equally important to automatically identify or learn appropriate content attributes of texts (Maes, 1994). If we obtain a fine-grained and clear user’s request, an almost similar fine-grained characterization of the
content of a document is needed for an accurate comparison of information need and document.
7.4 Document Engineering
The technological shift to multimedia environments affects the coding and structure of electronic documents. Electronic documents become more complex, They are bestowed with attributes, which form a document description. Also the linguistic text message in an electronic medium is structured and delivered distinctively from the print and paper medium (McArthur, 1987). Texts have stylistic attributes (e.g., used style and fonts), extensional attributes (e.g., name of the author, date of creation), which are also called objective identifiers, and content attributes (e.g., key terms, links), which are called non-objective identifiers (cf. Salton, 1989, p. 276).
These attributes are recognizable by their mark-ups in the document.
Differentstandards for document description allow using the documents and their attributes independent of the hardware and the application software.
Examples of such standards are SGML (Standard Generalized Mark-up Language) and HTML (HyperText Markup Language). The use of such mark-ups greatly benefits the accessibility of the information contained in and attached to the documents.
Despite the appeal and promise of such an approach, one must be aware of its limits among which the complexity and cost of assigning the mark-ups.
The creation of current and future electronic documents is sometimes compared with the creation of software (Walker, 1989). Hence, the term document engineering is in use. Creation of electronic documents is a complex task. Compared to the field of software engineering there is a clear need for modularity, abstraction, and consistency. Objective identifiers, such as authors’ names, publisher’s names, and publication date, in general pose no dispute about how to assign them. When mark-ups regard content attributes (e.g., key terms and hypertext links), one must be aware of costly and sometimes subjective and inconsistent attribution of these attributes. The intellectual assignment of content mark-up is considered as a form of manual indexing (Croft et al., 1990). Multiple studies indicate that manual indexing is inconsistent and subjective (Beghtol, 1986; Collantes, 1995).
“Interindexer consistency” exhibits a direct positive influence upon retrieval effectiveness (paper of Leonard cited in Ellis, Furner, & Willett, 1996). Yet, we don’t have many studies about “interlinker consistency”.A study of Ellis, Furner-Hines, and Willett (1 994) shows little similarity between the link-sets inserted by different persons in a set of full-text documents. These authors were not able to prove a positive relationship between inter-linker
consistency and navigational effectiveness in hypertext systems (Ellis et al., 1996). The problem might be alleviated when the text writer acts as a document engineer and is responsible for assignment of content attributes and links. In this way, the writer of text defines possible text uses and navigation between texts (cf. Barrett, 1989; Frants, Shapiro, Voiskunskii, 1997, p. 137). Moreover, the document engineering is not always cost effective, especially when dealing with heterogeneous material such as text content. Because of a better accessibility of the information through document mark-ups, time is gained when searching information. However, extra time is needed to accurately assign mark-ups.
Hence, the document engineer could use some extra automatic support for assigning content attributes to texts at the time of document creation (Alschuler, 1989; Wright & Lickorish, 1989; Brown, Foote, Jones, Sparck Jones, & Young, 1995). Especially for large active document collections, such as news texts, intended for a heterogeneous audience, this might be beneficial (Allen, 1990).