TEXT REPRESENTATIONS AND THEIR USE
5. USE OF THE TEXT REPRESENTATIONS
5.2 Information Retrieval Systems
A typical information retrieval (IR) system selects documents from a collection in response to a user’s query, and ranks these documents according to their relevance to the query (Salton, 1989, p. 229 ff.). This is usually accomplished by matching a text representation with a representation of the query. It was Luhn (1957) who suggested this procedure.
A search request orquery, which is a formal representation of a user's information need as submitted to a retrieval system, usually consists either of a single term from the indexing vocabulary or of some logically or numerically weighted combination of such terms. In case of the search request is originally formulated in natural language, a formal representation can be derived by applying simple indexing techniques or by analyzing the request with natural language processing techniques.
The abstract representations of both document text and query make an effective comparison possible. The texts, the representations of which best match the request representation, are retrieved. Commonly, a list of possible relevant texts is returned. In information retrieval, a rather static document collection is queried by a large variety of volatile queries. A variant form of information retrieval isroutingorfiltering(Belkin & Croft, 1992). Here, the information needs are long-lived, with queries applied to a collection that rapidly changes over time. Filtering is usually based on descriptions of information preferences of an individual or a group of users, which are called “users’ profiles”.
Retrieval is based on representations of textual content and information needs, and their matching. There are a number of retrieval models that are defined by the form used in representing document text and request and by the matching procedure. Both text and information need representations are uncertain and additionally do not always exact match. Querying an information retrieval system is not like querying a classical database. The matching is not deterministic. Retrieval models often incorporate this element of uncertainty. Moreover, retrieval models generally rank the retrieved documents according to their potential relevancy to the query. This is why they are sometimes called ranking models (Harman, 1992a). Because the retrieval models developed in an environment with documents that were manually indexed with a set of terms, many models rely upon this form of text descriptors. In the following, we give an overview of the most common retrieval models.
The Boolean model
In the oldest model, the Boolean retrieval model (Salton, 1989, p. 235 ff.;
Smeaton, 1986), a query has the form of an expression containing index terms and Boolean operators (e.g., “and”, “or”, “not”) defined upon the terms. The retrieval model compares the Boolean query statement with the term sets used to identify document content. A document the index terms of which satisfy the query is returned as relevant. This retrieval model is still employed in many commercial systems. It is a powerful retrieval model, when the users of the retrieval system are trained in designing Boolean queries. In the pure Boolean model, no ranking of the documents according to relevance is provided. Variants of the model provide ranking based upon partial fulfillment of the query expression.
The vector space model
In the vector space retrieval model (Salton, 1989, p. 313 ff.; Wang, Wong, & Yao, 1992), documents and queries are represented as vectors in a vector space with the relevance of a document to a query computed as a distance measure. Both query and documents are represented as term vectors of the form:
Dm=(am1,am2, . . .,amn) Qk =(qk1,qk2, . . .,qkn)
where the coefficients ami and qki represent the values of index term i in document Dm and Qk respectively. Typically ami (or qki) is set equal to 1 when termi appears in documentDm or queryQkrespectively and to 0 when the term is absent (vectors with binary terms). Alternatively, the vector coefficients could take on numeric values indicating the weight or importance of the index terms (vector with weighted terms). As a result, a document text and query are represented in an n-dimensional vector space (withn= number of distinct terms in the index term set of the collection).
Comparing document and query vector is done by computing the similarity between them (Jones & Furnas, 1987). The most common similarity functions are the cosine function, which computes the cosine of the angle between two term vectors, and the inner product, which computes the scalar product between the term vectors. The result of the comparison is a ranking of the documents according to their similarity with the query.
The vector model is very popular and successful in research settings and commercial systems because of the simplicity of the representation, its application in unrestricted subject domains and upon different text types, and simple comparison operations. It has been criticized because it does not accurately represent queries and documents (Raghaven & Wong, 1986). It adapts a simplifying assumption that terms are not correlated and term vectors are pair-wise orthogonal. However, many useful and interesting retrieval results have been obtained despite the simplifying assumptions.
The probabilistic model
The probabilistic retrieval model (Fuhr, 1992) views retrieval as a problem of estimating the probability that a document representation matches or satisfies a query. The term “probabilistic retrieval model” is generally used to refer to retrieval models that produce the probability that a document is relevant for the query and rank documents according to these probabilities (“Probability Ranking Principle”) (Robertson, 1977; Croft &
Turtle, 1992). In this view, many retrieval models can be seen as probabilistic. Often, the term specifically refers to retrieval models that learn the weight of query terms from the documents that are judged relevant or non-relevant for the query and that contain or do not contain the terms. The earliest probabilistic models that learn the weight or probability of a query term from a training corpus are described by Maron and Kuhns (1960) and Robertson and Sparck Jones (1976). The current models use more refined statistical techniques, such as 2-Poisson distributions (Robertson & Walker, 1994) and logistic regression (Gey, 1994) for estimating this probability.
When estimating the probability of the relevance of a document to a query, term independence is assumed.
Probabilistic models are in use in some commercial systems and are being actively researched.
The next two models infer the relevancy of a document from the query.
The inference relies upon knowledge that reflects the properties of the subject domain, upon linguistic knowledge, and/or knowledge of the supposed retrieval strategies of a user. The knowledge contributes in building semantically rich representations of the content of document and query, It is assumed that these semantic representations help in identifying meaningful documents for the user. The inference strategy in both models is different. In the network model inference is based upon the combination of evidence as it is propagated in a network. In the logic-based model logical rules are used to deduce the relevancy of a document for a query. Both models provide the possibility of reasoning with uncertainty. Their major bottleneck is acquiring and implementing the knowledge bases.
The network model
In the network retrieval model (Croft & Turtle, 1992; Turtle & Croft, 1992) document and query content are represented as networks. Estimating the relevance of a document is accomplished by linking the query and document networks, and by inferring the relevancy of the document for the query. The model is also well suited to reason with uncertain information:
Bayesian networks are used for probabilistic representation of the content of documents and query and for probabilistic inference (Del Favero & Fung,
1994; Fung & Del Favero, 1995).
Networks are very well suited to represent structure and content of documents and queries. The networks have the form of directed acyclic graphs (type DAG). The inference network model is popular in information retrieval. In typical cases, the nodes of the document network represent identifiers, concepts, or index terms. Each document typically has a text node, which corresponds with a specific text representation and which is composed of the components that make up the representation. A document can have multiple text nodes that are generated with different indexing techniques. Intermediate levels in the representation are possible (e.g., concepts and their referring index terms in the texts). The relationships between nodes in a network may be probabilistic or weighted. Each set of arcs into a node represents a (probabilistic) dependence between the node and its parents (the nodes at the other ends of the incoming arcs). Often a document network is once built for the complete document collection. A
similar representation is generated for the query. The two networks are connected by their common concepts and form the inference or causal network.
The retrieval is a process of inference on the network. Especially the Bayesian inference applied upon multiple sources of uncertain evidence is attractive in an information retrieval context. Retrieval is then a process of combining uncertain evidences from the network and inferring a belief that a document is relevant. This belief is computed as the propagation of the probabilities from a document node to the query node. Documents are ranked according to this belief of relevance.
The logic-based model
The logic-based retrieval model (van Rijsbergen, 1986; Chiaramella &
Chevallet, 1992; Lalmas, 1998) assumes that queries and documents can be represented by logical formulas. The retrieval then is inferring the relevance of a document for a query. The above Boolean model is logic-based. But, the typical logic-based model will use the information in query and document in combination with domain knowledge, linguistic knowledge, and knowledge of users’ interests and strategies from a coded knowledge base. The knowledge will be used by the matching function as part of proving that the document implies the query.
The relevance of a document to a query is deduced by applying inference rules.In the logical model relevance of a document for a query is defined as:
Given a query Q and a document D, D is relevant to Q if D logically implies Q (D->Q). Boolean logic is too restricted for this task. It cannot deal with temporal and spatial relationships, and especially not with contradictory information or uncertain information. In order to cope with uncertainty, a logic for probabilistic inference is introduced with the notion of uncertain implication: D logically implies Q with certainty P (P (D->Q)). The evaluation of the uncertainty function P is related to the amount of semantic information which is needed to prove that D->Q. Ranking according to relevance then depends upon the number of transformations necessary to obtain the matching and the credibility of the transformations. To represent uncertain implications and reason with them, modal logic is sometimes used (Nie, 1989; van Rijsbergen, 1989; Chiaramella & Nie, 1990; Nie, 1992). For instance, when a matching between query and text representation is not successful, the text representation is transformed in order to satisfy other possible interpretations (cf. the possible worlds of the modal logic) that might match the query.
In a multi-media environment, logic-based retrieval has the advantage to easily integrate text representations with other forms of document representations (e.g., logical structure, content of images) (cf. Bruza & van der Weide, 1992; Chiaramella & Kheirbek, 1996; Fuhr, Gövert, & Rolleke,
1998).
The cluster model
In the cluster retrieval model a query is ranked against a group of documents (van Rijsbergen, 1979, p. 45 ff.; Griffiths, Luckhurst, & Willett, 1986; Salton, 1989, p. 341 ff.; Hearst & Pedersen, 1996). The general assumption is that mutually similar documents will tend to be relevant to the same queries, and, hence, that automatic determination of groups of such documents increases the efficiency of the search of relevant documents and can improve the recall of the retrieval. Similar documents are grouped in a cluster. For each cluster, a representation is made (e.g., the average vector (centroid) of the cluster) against which a query is matched. Upon matching, the query retrieves all the documents of the cluster. Typically a fixed text corpus is clustered either to an exhaustive partition, disjoint or otherwise, or into a hierarchical tree structure. In case of a partition, queries are matched against clusters and the contents of the best scoring clusters are returned as a result, possibly sorted by score. In the case of a hierarchy, queries are processed downward, always taken the highest scoring branch, until some stopping condition is achieved. The subtree at that point is then returned as a result. Hybrid strategies are also available. Documents and query are commonly represented as term vectors. The similarity between pairs of vectors is computed with similarity functions (see above). Different algorithms for clustering the term vector of documents are available (for an overview, see Willett, 1988).