Information Retrieval: Implementing and Evaluating Search Engines

9 Language modeling and related methods 286 9.1 Generating queries from documents 287 9.2 Language models and smoothing 289 9.3 Ranking with language models 292 9.4 Kullback-Leibler Divergence 296 9.5 Divergence from Randomness 298.

NEXI 572 16.2.3 XQuery 574

Stefan B¨uttcher, Charles Clarke and Gordon Cormack comprise three generations of stellar information retrieval researchers with more than fifty years of combined experience. The authors provide a tutorial overview of current information seeking research with hundreds of references in the research literature, but they go well beyond the usual survey.

Preface

These references and exercises are also an opportunity to mention important concepts and topics that the main body of the chapter could not cover. The book's organization allows readers to focus on different aspects of the topic.

Notation

I Foundations

1 Introduction

What Is Information Retrieval?

Web Search
Other Search Applications
Other IR Applications

If you have a computer close to the Internet, pause for a minute to open a browser and try the search term "information retrieval" on one of the major commercial web search engines. Look through the following ten results and decide if any one of them could better replace one of the top ten results.

Information Retrieval Systems

Basic IR System Architecture
Documents and Update
Performance Evaluation

Depending on the information need, a search term can be a date, a number, a musical note or a sentence. In particular, the basic concept of relevance can be extended to the size and scope of the returned documents.

Working with Electronic Text

Text Formats
A Simple Tokenization of English Text
Term Distributions
Language Modeling

For example, the content of a PostScript document is encoded in a version of the programming language Forth. Here we have a parameter corresponding to each term - the probability that the term appears next in the invisible text.

Test Collections

TREC Tasks

One of the larger of the two document sets used in our experiments is the previously mentioned GOV2 corpus. The TREC45 collection can be obtained from the NIST Standard Reference Data Products Web site as Special Databases 22 and 23.4 The GOV2 collection is distributed by the University of Glasgow.5 The subjects and qrels for these collections can be obtained from the data archive TREC.6.

Open-Source IR Systems

Lucene
Indri
Wumpus

Instead, each part of the text collection may represent a potential entity for retrieval, depending on the structural search constraints specified in the query. In addition, it is able to perform real-time index updates (ie add/remove ﬁles to/from the index) and supports multi-user security restrictions, which are useful if the system has more than one user and each user is only allowed to search parts of the index.

Exercises

Following the style of Figure 1.8, create three to four topics suitable for testing retrieval performance on the English-language Wikipedia. Submit the titles of the topics you created in Exercise 1.10 as queries to the system.

Bibliography

2 Basic Techniques

Inverted Indices

Extended Example: Phrase Search
Implementing Inverted Indices
Documents and Other Elements

Let's say we want to find all occurrences of the term "first witch" in our collection of Shakespearean plays. If the phrase at the end of the loop occurs in the interval [position,v], atv ends.

Retrieval and Ranking

The Vector Space Model
Proximity Ranking
Boolean Retrieval

These features include the length of documents (ld) relative to the average document length (lavg), as well as the number of documents in which a term appears (Nt) relative to the total number of documents in the collection (N). Query processing for the vector space model is straightforward (Figure 2.9), essentially performing a merge of the mailing lists for the query terms. However, a token matching a term appears at most n·l covers, where is the length of the shortest list of posts for the terms in the vector.

Evaluation

Recall and Precision
Eﬀectiveness Measures for Ranked Retrieval
Building a Test Collection
Eﬃciency Measures

Running this query over the TREC45 collection produces a set of 881 documents, representing 0.17% of the half million documents in the collection. If the user starts reading from the top of the list, she will find four relevant documents in the top ten. In the case of the TREC45 collection, this change hurts performance, but significantly improves performance on the GOV2 collection.

Summary

To increase the accuracy of the measurements, the set of queries can be executed multiple times, with the system reset each time and an average of the measured execution time used to calculate the average response time. As an example, Table 2.7 compares the average response time of a schema-independent index versus a frequency index, using the Wumpus implementation of the Okapi BM25 ranking function (shown in Table 2.6). To a user, a response time of 202 ms would seem instantaneous, while a response time of 4.7 seconds would be a noticeable delay.

Exercises

On the other hand, the list of posts for "the" may not be committed to memory in its entirety, and only a very small number of these posts may be part of the target phrase. Test your implementation using the test collection developed in Exercise 2.13 or with any other available collection, such as a TREC collection. Each student must then run the topic headings as questions into their system.

Bibliography

3 Tokens and Terms

English

Punctuation and Capitalization
Stemming
Stopping

Characters
Character N-Grams
European Languages
CJK Languages
Further Reading
Exercises
Bibliography

The name of the band “The The” is another example – and further demonstrates the importance of capitalization. A code point is written in the form U+nnnn, where nnnn indicates the value of the code point in hexadecimal. If the most significant bit is 0 – so that the byte is of the form 0xxxxxxx – the length of the encoding is one byte.

II Indexing

4 Static Inverted Indices

Index Components and Index Life Cycle

It is the job of the dictionary to provide this mapping from terms to the location of their lookup lists in the index. Query Processing: The information stored in the index that was built in Phase 1 is used to process search queries. We also discuss how the organization of the dictionary and lookup lists should be different from that suggested in the first part of the chapter if we want to maximize their performance at indexing time.

The Dictionary

For a typical collection of natural language text, the dictionary is relatively small compared to the total size of the index. Obviously, it is not possible to allocate 74 KB of memory for each term in the dictionary. Thus, the dictionary represents a major bottleneck in the indexing process, so searches should be as fast as possible.

Postings Lists

It contains a copy of a subset of the posts in the list, for example, a copy of every 5000th post. Ultimately this leads to a multi-level static B-tree that provides efficient random access to the mailing list. However, by compressing posts into small chunks, where the start of each chunk corresponds to a synchronization point in the index for terms, the search engine can provide efficient random access even for compressed lists of posts.

Interleaving Dictionary and Postings Lists

Choosing a block size of B = 1,024 bytes, the number of dictionary entries in memory can be reduced to 3 million. As we increase the block size, the number of dictionary entries in memory decreases and index access latency increases. Each in-memory dictionary entry is of the form (term, post), indicating the first term and the first post in a given index block.

Index Construction

In-Memory Index Construction
Sort-Based Index Construction
Merge-Based Index Construction

The other interesting aspect of the simple memory index construction method, besides the dictionary implementation, is the implementation of the expandable lists of in-memory lookups. The figure shows that the performance of the final merge operation is highly dependent on the amount of main memory available for the indexing process. The performance of phase 1 (building index partitions) is largely independent of the amount of available main memory.

Other Types of Indices

Summary

Exercises

The number of publications per synchronization point is called index fragmentation per term. For the access pattern above, what is the optimal fragmentation (ie, the one that minimizes disk I/O). However, the join-based index construction method of Section 4.5.3 has a running time that is linear with the collection size (see Table 4.7, page 130).

Bibliography

5 Query Processing

Query Processing for Ranked Retrieval

Document-at-a-Time Query Processing
Term-at-a-Time Query Processing
Precomputing Score Contributions
Impact Ordering
Static Index Pruning

The worst-case time complexity of the revised version of the document-at-a-time algorithm is Instead of combining lists of search term posts using a heap, the search engine looks at all (or some) of the posts for each search term in turn. For the term-at-a-time algorithm with accumulator pruning (Figure 5.5), for example, we had to resort to some tricks to efficiently obtain an approximation of the highest-scoring posts in ti's list of posts.

Lightweight Structure

Generalized Concordance Lists
Operators
Implementation

The G() function is applied to ensure that the result is a GC list. operator joins two GC lists: Each interval in the result is an interval of one of the operands. Any larger interval that satisfies the Boolean expression will have an interval from the GC list contained within it. An interval from the resulting GC list starts with an interval fromA and ends with an interval from B.

Bibliography

In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pagina's 43-50. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pagina's 191–198. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pagina's 219–225.

6 Index Compression

General-Purpose Data Compression

The final section (Section 6.4) covers compression algorithms for dictionary data structures and shows how the memory requirements of the search engine can be substantially reduced by storing the dictionary data in memory in a compressed form. In a lossy method, Ci is not an exact copy of A, but an approximation somewhat similar to the original version. In this chapter we focus exclusively on lossless compression algorithms, where the decoder produces an exact copy of the original data.

Symbolwise Data Compression

Modeling and Coding
Huﬀman Coding

For example, the character “u,” which appears 114,592 times in the set, is somewhere in the middle of the total frequency range. A model M in which the probability of a symbol is independent of the previously seen symbols is called a zero-order model. 6.6) Based on the lengths of the code words, this code also appears to be optimal compared to M0.

Information Retrieval: Implementing and Evaluating Search Engines

NEXI 572 16.2.3 XQuery 574

Preface

1 Introduction

What Is Information Retrieval?

Information Retrieval Systems

Working with Electronic Text

Test Collections

Open-Source IR Systems

Further Reading

Exercises

2 Basic Techniques

Inverted Indices

Retrieval and Ranking

Evaluation

Summary

Further Reading

Exercises

3 Tokens and Terms

4 Static Inverted Indices

Index Components and Index Life Cycle

The Dictionary

Postings Lists

Interleaving Dictionary and Postings Lists

Index Construction

Exercises

5 Query Processing

Query Processing for Ranked Retrieval

Lightweight Structure

Further Reading

Bibliography

6 Index Compression

General-Purpose Data Compression

Symbolwise Data Compression