However, over the past decade, relentless optimization of information search performance has propelled web search engines to new levels of quality where most people are mostly satisfied, and web search has become the standard and often preferred source of information retrieval. Much of the scholarly research on information retrieval has occurred in these contexts, and much of the subsequent information retrieval practices are concerned with providing access to unstructured information in various business and government domains, and this work forms much of the foundation of our book.
Book organization and course development
Chapters 13-17 provide a treatment of different types of machine learning and numerical methods in information retrieval. In Chapter 16, we first give an overview of a number of important applications of clustering in information retrieval.
Prerequisites
Book layout
Acknowledgments
We thank them for their significant influence on the content and structure of the book. Parts of the initial drafts of Chapters 13–15 were based on slides that were generously provided by Ray Mooney.
Web and contact information
An example information retrieval problem
Let's stick with Shakespeare's Collected Works and use it to introduce the basics of the Boolean retrieval model. Each item in the list – which records the occurrence of a term in a document (and later often also the positions in the document) – is conventionally called aposting.4 The list is then called apostings.
A first take at building an inverted index
This idea is central to the first key concept in information retrieval, the inverted index. The dictionary stores terms and has an index to the mailing list for each term.
Processing Boolean queries
We can then process the query in increasing order of the size of each split term. For the following questions, can we still go through the intersection at timeO(x+y), where are the lengths of the mailing lists for Brutus and Caesar.
The extended Boolean model versus ranked retrieval
Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw's own reference librarians for most of the information needs in his experiments. Although the major web search engines differ in their emphasis on free text querying, most of the basic issues and technologies of indexing and searching remain the same, as we will see in later chapters.
The book (Witten et al. 1999) is the standard reference for an in-depth comparison of the space and time efficiency of the inverted index versus other possible data structures; a more concise and up-to-date presentation appears in Zobel and Moffat (2006). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms a system uses (Section 2.2).
Document delineation and character sequence decoding
- Obtaining the character sequence in a document
- Choosing a document unit
Again, we must determine the document format, and then an appropriate decoder must be used. Finally, the text portion of the document may need to be extracted from other material that will not be processed.
اا ا1962
The representation of short vowels (here /i/ and /u/) and the final /n/ (nunation) deviates from strict linearity by being represented as diacritics above and below letters. Day-to-day text is unvocalized (short vowels are not represented, but the letter for ¯a will still occur) or partially vocalized, with short vowels inserted in places where the author perceives ambiguities.
START
Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization
For example, French has a variant use of the apostrophe for the reduced definite article 'the' before a word beginning with a vowel (e.g. l'ensemble), and has some uses of the hyphen with postposed clitic pronouns in imperatives and questions (e.g., donne- moi'give me'). His name is written in syllabic katakana in the middle of the first line.
Faster postings list intersection via skip pointers
Building efficient skip indexes is easy if the index is relatively static; it's more difficult if the list of posts is constantly changing due to updates. How many posting comparisons would be made if the posting lists were chopped without using skip pointers.
Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes
To process a query using such an extended two-word index, we must also parse it into N's and X's, and then segment the query into extended two-words, which can be looked up in the index. Good queries to include in the phrase index are those that are common based on recent query behavior.
For further discussion of Chinese word segmentation, see Sproat et al. (1996), Sproat and Emerson (2003), Tseng et al. Silverstein et al. (1999) note that many queries without explicit phrase operators are actually implicit phrase searches.
Search structures for dictionaries
Here we develop techniques that are robust against typographical errors in the search query, as well as against alternative spellings. In section 3.1 we develop data structures that help search for vocabulary terms in an inverted index.
Wildcard queries
- General wildcard queries
We refer to the set of rotated terms in the permutation index as the permutation term vocabulary. Write down the entries in the permuterm index dictionary generated by the termmama.
Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- Context sensitive spelling correction
How many original vocabulary terms can there be in the lookup list for a permutation vocabulary term. A single scan of the entries (much like in Chapter 1) would let us enumerate all such expressions; in the example in figure 3.7, we would enumerate table, boardroom and borders.
Phonetic correction
Pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits. Given a query (sayherman), we compute its soundex code and then retrieve all vocabulary terms matching that soundex code from the soundex index before running the resulting query on the inverted standard index.
We then introduce blocked sort-based indexing (Section 4.2), an efficient one-machine algorithm designed for static collections that can be seen as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1. Collections with frequent changes require the nominal indexing introduced in section 4.5 so that changes to the collection are immediately reflected in the index.
Hardware basics
The seek time is the time required to place the disk head in a new position. The transfer time per byte is the rate of transfer from disk to memory when the head is in the correct position.
Blocked sort-based indexing
Therefore, collecting all termID-docID pairs of the collection using 4 bytes each for termID and docID requires 0.8 GB of storage space. BSBI (i) segments the collection into equal-sized chunks, (ii) sorts each chunk's termID-docID pairs in memory, (iii) saves the sorted intermediate results to disk, and (iv) merges all intermediate results into the final index .
Single-pass in-memory indexing
When the memory runs out, the block index (which consists of the dictionary and lists of posts) is written to disk (line 12). Compression further increases the efficiency of the algorithm because larger blocks can be processed and because individual blocks require less disk space.
Distributed indexing
For example, Figure 4.5 shows three a–f segment files of the a–f partition, corresponding to the three parsers shown in the figure. Finally, the list of values for each key is sorted and written to the final sorted list of entries ("entries" in the figure). (Note that lookup in Figure 4.6 includes term frequencies, whereas each lookup in the other sections of this chapter is just a docID without term frequency information.) The data flow is shown for a–f in Figure 4.5.
Dynamic indexing
In this scheme, we process each post ⌊T/n⌋times because we touch it during each ⌊T/n⌋merge where is the size of the auxiliary index and is the total number of posts. Create a table that shows, for each time at which T=2∗ktokens have been processed (1≤k≤15), which of the three indicesI0,.
Other types of indexes
During query processing, a user's access record list is traversed with the result list returned by the text part of the index. For this collection, compare the memory, disk, and time requirements of the simple algorithm in Figure 1.4 and block sort-based indexing.
Search engines use some parts of the dictionary and the index much more than others. As a result, we are able to significantly reduce the response time of the IR system.
Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms
This chapter first provides a statistical characterization of the distribution of the entities we wish to compress – terms and entries in large collections (section 5.1). This helps us characterize the properties of the algorithms for compressing lookup lists in Section 5.3.
Dictionary compression
- Dictionary as a string
- Blocked storage
The cursor to the next link is also used to delimit the end of the current link. We store the length of the term in the string as an extra byte at the beginning of the term.
Postings file compression
- Variable byte codes
With VB compression, the size of the compressed index for Reuters-RCV1 is 116 MB, as we verified in an experiment. Suppose the length of the message list is stored separately, so that the system knows when a message list is complete.
In recent work, Anh and Moffat (2005;2006a) and Zukowski et al. (2006) have constructed word-matched binary codes that are both faster in decompression and at least as efficient as VB codes. Zhang et al. (2007) investigate the increased effectiveness of caching when a number of different message list compression techniques are used on modern hardware. Although Elias codes are only asymptotically optimal, arithmetic codes (Witten et al. 1999, section 2.4) can be constructed to be arbitrarily close to the optimal H(P) for anyP.
Parametric and zone indexes
- Weighted zone scoring
- Learning weights
We now consider a simple case of weighted zone counting, where each document has a title zone and abodyzone. For the value of estimated in Exercise 6.5, calculate the weighted zone score for each (query, document) example.
Term frequency and weighting
- Inverse document frequency
- Tf-idf weighting
The immediate idea is to reduce the weights of terms with a high collection frequency, which is defined as the total number of occurrences of the term in the collection. If the logarithm in (6.7) is calculated with base 2, suggest a simple approximation of the term IDF.
The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores
We now apply Euclidean normalization to the tf values from the table for each of the three documents in the table. This measure is the cosine of the angle θ between the two vectors, shown in Figure 6.10.
Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
Verify that the sum of the squares of the components of each of the document vectors in Exercise 6.15 is 1 (to within rounding error). By turning a query into a unit vector in Figure 6.13, we assigned equal weights to each of the query terms.
With Section7.1 in place, we essentially have all the components needed for a complete search engine. In Section 7.2, we outline a complete search engine, including indexes and structures to support not only cosine setting, but also more general ranking factors such as query term proximity.
Efficient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning
We only consider documents that contain many (and as a special case, all) of the query terms. First, consider ordering the documents in the placement list for each term by decreasing value ofg(d).
Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together
Especially for free-text queries on the web (Chapter 19), users prefer a document in which most or all query terms appear close to each other. Let ω be the width of the smallest window in a document that contains all search terms, measured in terms of the number of words in the window.
Vector space scoring and query operator interaction
Vector space scoring supports so-called free-text retrieval, where a query is specified as a set of words without any query operators connecting them. If a search engine allows a user to specify a wildcard operator as part of a free-text query (eg, the queryrom* restaurant), we can interpret the wildcard component of the query as forming multiple terms in the vector space (in this example, rom.
In this chapter, we begin by discussing the performance measurement of IR systems (Section 8.1) and the test suites most commonly used for this purpose (Section 8.2). We then extend these notions and develop further measures for evaluating ranked search results (Section 8.4) and discuss the development of reliable and informative test suites (Section 8.5).
Information retrieval system evaluation
Standard test collections
Rather, NIST assessors' relevance judgments are available only for those documents that were among the top cry for some system entered into the TREC evaluation for which the information need was developed. Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current size of the document collections indexed by the major web search companies.
Evaluation of unranked retrieval sets
The harmonic mean is always less than the arithmetic or geometric mean, and often very close to the minimum of the two numbers. The harmonic mean is always less than or equal to the arithmetic mean and the geometric mean.
Evaluation of ranked retrieval results
For a single information need, average precision is approximately the area under the uninterpolated recall precision curve, so MAP is approximately the average area under the recall precision curve for a set of queries. Like precision at k, R-precision describes only one point on the precision-recall curve, rather than trying to summarize performance across the curve, and it is somewhat unclear why you should be interested in the breakpoint rather than the best point on the curve (the point with maximum F-measure) or search level of interest for a particular application (accuracy atk).