• Tidak ada hasil yang ditemukan

Interleaving Dictionary and Postings Lists

4 Static Inverted Indices

4.4 Interleaving Dictionary and Postings Lists

Chapter 4 Static Inverted Indices 114

disk. This decreases the seek distance between the individual lists and leads to better query performance. If lists were stored on disk in some random order, then disk seeks and rotational latency alone would account for almost a minute (4,365×12 ms), not taking into account any of the other operations that need to be carried out when processing the query. By arranging the inverted lists in lexicographical order of their respective terms, a query asking for all documents matching “inform∗” can be processed in less than 2 seconds when using a frequency index; with a schema-independent index, the same query takes about 6 seconds. Storing the lists in the inverted file in some predefined order (e.g., lexicographical) is also important for efficient index updates, as discussed in Chapter 7.

A separate positional index

If the search engine is based on a document-centric positional index (containing a docid, a frequency value, and a list of within-document positions for each document that a given term appears in), it is not uncommon to divide the index data into two separate inverted files: one file containing the docid and frequency component of each posting, the other file containing the exact within-document positions. The rationale behind this division is that for many queries — and many scoring functions — access to the positional information is not necessary. By excluding it from the main index, query processing performance can be increased.

4.4 Interleaving Dictionary and Postings Lists 115

Table 4.3 Number of unique terms, term bigrams, and trigrams for our three text collections.

The number of unique bigrams is much larger than the number of unique terms, by about one order of magnitude.

Tokens Unique Words Unique Bigrams Unique Trigrams Shakespeare 1.3×106 2.3×104 2.9×105 6.5×105 TREC45 3.0×108 1.2×106 2.5×107 9.4×107 GOV2 4.4×1010 4.9×107 5.2×108 2.3×109

phrase queries. Unfortunately, the number of unique bigrams in a text collection is substan- tially larger than the number of unique terms. Table 4.3 shows that GOV2 contains only about 49 million distinct terms, but 520 million distinct term bigrams. Not surprisingly, if trigrams are to be indexed instead of bigrams, the situation becomes even worse — with 2.3 billion different trigrams in GOV2, it is certainly not feasible to keep the entire dictionary in main memory anymore.

Storing the entire dictionary on disk would satisfy the space requirements but would slow down query processing. Without any further modifications an on-disk dictionary would add at least one extra disk seek per query term, as the search engine would first need to fetch each term’s dictionary entry from disk before it could start processing the given query. Thus, a pure on-disk approach is not satisfactory, either.

〈 〉 〈〉 〈〉 〈〉

Figure 4.5 Interleaving dictionary and postings lists: Each on-disk inverted list is immediately preceded by the dictionary entry for the respective term. The in-memory dictionary contains entries for onlysomeof the terms. In order to find the postings list for “shakespeareanism”, a sequential scan of the on-disk data between

“shakespeare” and “shaking” is necessary.

A possible solution to this problem is calleddictionary interleaving, shown in Figure 4.5. In an interleaved dictionary all entries are stored on disk, each entry right before the respective postings list, to allow the search engine to fetch dictionary entry and postings list in one sequen- tial read operation. In addition to the on-disk data, however, copies ofsomedictionary entries

Chapter 4 Static Inverted Indices 116

Table 4.4 The impact of dictionary interleaving on a schema-independent index for GOV2 (49.5 million distinct terms). By choosing an index block sizeB= 16,384 bytes, the number of in-memory dictionary entries can be reduced by over 99%, at the cost of a minor query slowdown: 1 ms per query term.

Index Block Size (in bytes) 1,024 4,096 16,384 65,536 262,144 No. of in-memory dict. entries (×106) 3.01 0.91 0.29 0.10 0.04 Avg. index access latency (in ms) 11.4 11.6 12.3 13.6 14.9

(but not all of them) are kept in memory. When the search engine needs to determine the loca- tion of a term’s postings list, it first performs a binary search on the sorted list of in-memory dictionary entries, followed by a sequential scan of the data found between two such entries. For the example shown in the figure, a search for “shakespeareanism” would first determine that the term’s postings list (if it appears in the index) must be between the lists for “shakespeare”

and “shaking”. It would then load this index range into memory and scan it in a linear fashion to find the dictionary entry (and thus the postings list) for the term “shakespeareanism”.

Dictionary interleaving is very similar to the self-indexing technique from Section 4.3, in the sense that random access disk operations are avoided by reading a little bit of extra data in a sequential manner. Because sequential disk operations are so much faster than random access, this trade-off is usually worthwhile, as long as the additional amount of data transferred from disk into main memory is small. In order to make sure that this is the case, we need to define an upper limit for the amount of data found between each on-disk dictionary entry and the closest preceding in-memory dictionary entry. We call this upper limit theindex block size. For instance, if it is guaranteed for every termT in the index that the search engine does not need to read more than 1,024 bytes of on-disk data before it reachesT’s on-disk dictionary entry, then we say that the index has a block size of 1,024 bytes.

Table 4.4 quantifies the impact that dictionary interleaving has on the memory consumption and the list access performance of the search engine (using GOV2 as a test collection). Without interleaving, the search engine needs to maintain about 49.5 million in-memory dictionary entries and can access the first posting in a random postings list in 11.3 ms on average (random disk seek + rotational latency). Choosing a block size ofB = 1,024 bytes, the number of in-memory dictionary entries can be reduced to 3 million. At the same time, the search engine’s list access latency (accessing the first posting in a randomly chosen list) increases by only 0.1 ms — a negligible overhead. As we increase the block size, the number of in-memory dictionary entries goes down and the index access latency goes up. But even for a relatively large block size of B = 256 KB, the additional cost — compared with a complete in-memory dictionary — is only a few milliseconds per query term.

Note that the memory consumption of an interleaved dictionary with block size B is quite different from maintaining an in-memory dictionary entry for every B bytes of index data.

For example, the total size of the (compressed) schema-independent index for GOV2 is 62 GB.

Choosing an index block size ofB= 64 KB, however, does not lead to 62 GB / 64 KB≈1 million

4.4 Interleaving Dictionary and Postings Lists 117

!

! !

" !

" !

" ! !

"# !

Figure 4.6 Combining dictionary and postings lists. The index is split into blocks of 72 bytes. Each entry of the in-memory dictionary is of the form (term,posting), indicating the first term and first posting in a given index block. The “#” symbols in the index data are record delimiters that have been inserted for better readability.

dictionary entries, but 10 times less. The reason is that frequent terms, such as “the” and “of”, require only a single in-memory dictionary entry each, even though their postings lists each consume far more disk space than 64 KB (the compressed list for “the” consumes about 1 GB).

In practice, a block size between 4 KB and 16 KB is usually sufficient to shrink the in-memory dictionary to an acceptable size, especially if dictionary compression (Section 6.4) is used to decrease the space requirements of the few remaining in-memory dictionary entries. The disk transfer overhead for this range of block sizes is less than 1 ms per query term and is rather unlikely to cause any performance problems.

Dropping the distinction between terms and postings

We may take the dictionary interleaving approach one step further by dropping the distinction between terms and postings altogether, and by thinking of the index data as a sequence of pairs of the form (term,posting). The on-disk index is then divided into fixed-size index blocks, with each block perhaps containing 64 KB of data. All postings are stored on disk, in alphabetical order of their respective terms. Postings for the same term are stored in increasing order, as before. Each term’s dictionary entry is stored on disk, potentially multiple times, so that there is a dictionary entry in every index block that contains at least one posting for the term. The in- memory data structure used to access data in the on-disk index then is a simple array, containing for each index block a pair of the form (term,posting), wheretermis the first term in the given block, andposting is the first posting fortermin that block.

Chapter 4 Static Inverted Indices 118

An example of this new index layout is shown in Figure 4.6 (data taken from a schema- independent index for the Shakespeare collection). In the example, a call to

next(“hurried”, 1,000,000)

would load the second block shown (starting with “hurricano”) from disk, would search the block for a posting matching the query, and would return the first matching posting (1,085,752). A call to

next(“hurricano”, 1,000,000)

would load the first block shown (starting with “hurling”), would not find a matching posting in that block, and would then access the second block, returning the posting 1,203,814.

The combined representation of dictionary and postings lists unifies the interleaving method explained above and the self-indexing technique described in Section 4.3 in an elegant way.

With this index layout, a random access into an arbitrary term’s postings list requires only a single disk seek (we have eliminated the initialization step in which the term’s per-term index is loaded into memory). On the downside, however, the total memory consumption of the index is higher than if we employ self-indexing and dictionary interleaving as two independent techniques. A 62-GB index with block size B = 64 KB now in fact requires approximately 1 million in-memory entries.