Test Collections - Information Retrieval: Implementing and Evaluating Search Engines

1 Introduction

1.4 Test Collections

Chapter 1 Introduction 24

<DOC>

<HEADLINE> COUNCIL VOTES TO EDUCATE DOG OWNERS </HEADLINE>

<P>

The City Council stepped carefully around enforcement of a dog-curbing ordinance this week, vetoing the use of police to enforce the law.

</P>

. . .

</DOC>

Figure 1.7 Example TREC document (LA051990-0141) from disk 5 of the TREC CDs.

1.4.1 TREC Tasks

Basic search tasks — in which systems return a ranked list from a static set of documents using previously unseen topics — are referred to as “adhoc” tasks in TREC jargon (often written as one word). Along with a set of documents, a test collection for an adhoc task includes sets of topics, from which queries may be created, and sets of relevance judgments (known as “qrel files” or just “qrels”), indicating documents that are relevant or not relevant to each topic. Over the history of TREC, adhoc tasks have been a part of tracks with a number of different research themes, such as Web retrieval or genomic IR. Despite differences in themes, the organization and operation of an adhoc task is basically the same across the tracks and is essentially unchanged since the earliest days of TREC.

Document sets for older TREC adhoc tasks (before 2000) were often taken from a set of 1.6 million documents distributed to TREC participants on ﬁve CDs. These disks contain selections of newspaper and newswire articles from publications such as theWall Street Journaland theLA Times, and documents published by the U.S. federal government, such as the Federal Register and theCongressional Record. Most of these documents are written and edited by professionals reporting factual information or describing events.

Figure 1.7 shows a short excerpt from a document on disk 5 of the TREC CDs. The document appeared as a news article in theLA Timeson May 19, 1990. For the purposes of TREC experiments, it is marked up in the style of XML. Although the details of the tagging schemes vary across the TREC collections, all TREC documents adhere to the same tagging convention for identifying document boundaries and document identifiers. Every TREC document is sur- rounded by <DOC>...</DOC> tags; <DOCNO>...</DOCNO>tags indicate its unique identifier. This identifier is used in qrels files when recording judgments for the document. This convention simplifies the indexing process and allows collections to be combined easily. Many research IR systems provide out-of-the-box facilities for working with documents that follow this convention.

1.4 Test Collections 25

<top>

<num> Number: 426

<title> law enforcement, dogs

<desc> Description:

Provide information on the use of dogs worldwide for law enforcement purposes.

<narr> Narrative:

Relevant items include specific information on the use of dogs during an operation. Training of dogs and their handlers are also relevant.

</top>

Figure 1.8 TREC topic 426.

Document sets for newer TREC adhoc tasks are often taken from the Web. Until 2009, the largest of these was the 426GB GOV2 collection, which contains 25 million Web pages crawled from sites in the U.S. government’sgovdomain in early 2004. This crawl attempted to reach as many pages as possible within thegovdomain, and it may be viewed as a reasonable snapshot of that domain within that time period. GOV2 contains documents in a wide variety of formats and lengths, ranging from lengthy technical reports in PDF to pages of nothing but links in HTML. GOV2 formed the document set for the Terabyte Track from TREC 2004 until the track was discontinued at the end of 2006. It also formed the collection for the Million Query Track at TREC 2007 and 2008.

Although the GOV2 collection is substantially larger than any previous TREC collection, it is still orders of magnitude smaller than the collections managed by commercial Web search engines. TREC 2009 saw the introduction of a billion-page Web collection, known as the ClueWeb09 collection, providing an opportunity for IR researchers to work on a scale comparable to commercial Web search.³

For each year that a track operates an adhoc task, NIST typically creates 50 new topics.

Participants are required to freeze development of their systems before downloading these topics.

After downloading the topics, participants create queries from them, run these queries against the document set, and return ranked lists to NIST for evaluation.

A typical TREC adhoc topic, created for TREC 1999, is shown in Figure 1.8. Like most TREC topics, it is structured into three parts, describing the underlying information need in several forms. The title ﬁeld is designed to be treated as a keyword query, similar to a query that might be entered into a search engine. The description ﬁeld provides a longer statement of the topic requirements, in the form of a complete sentence or question. It, too, may be used

3boston.lti.cs.cmu.edu/Data/clueweb09

Chapter 1 Introduction 26

Table 1.2 Summary of the test collections used for many of the experiments described in this book.

Document Set Number of Docs Size (GB) Year Topics

TREC45 0.5 million 2 1998 351–400

1999 401–450

GOV2 25.2 million 426 2004 701–750

2005 751–800

as a query, particularly by research systems that apply natural language processing techniques as part of retrieval. The narrative, which may be a full paragraph in length, supplements the other two ﬁelds and provides additional information required to specify the nature of a relevant document. The narrative ﬁeld is primarily used by human assessors, to help determine if a retrieved document is relevant or not.

Most retrieval experiments in this book report results over four TREC test collections based on two document sets, a small one and a larger one. The small collection consists of the documents from disks 4 and 5 of the TREC CDs described above, excluding the documents from the Congressional Record. It includes documents from the Financial Times, the U.S.Federal Register, the U.S. Foreign Broadcast Information Service, and the LA Times. This document set, which we refer to asTREC45, was used for the main adhoc task at TREC 1998 and 1999.

In both 1998 and 1999, NIST created 50 topics with associated relevance judgments over this document set. The 1998 topics are numbered 351–400; the 1999 topics are numbered 401–

450. Thus, we have two test collections over the TREC45 document set, which we refer to as TREC45 1998 and TREC45 1999. Although there are minor diﬀerences between our experimental procedure and that used in the corresponding TREC experiments (which we will ignore), our experimental results reported over these collections may reasonably be compared with the published results at TREC 1998 and 1999.

The larger one of the two document sets used in our experiments is the GOV2 corpus men- tioned previously. We take this set together with topics and judgments from the TREC Terabyte track in 2004 (topics 701–750) and 2005 (751–800) to form the GOV2 2004 and GOV2 2005 collections. Experimental results reported over these collections may reasonably be compared with the published results for the Terabyte track of TREC 2004 and 2005.

Table 1.2 summarizes our four test collections. The TREC45 collection may be obtained from the NIST Standard Reference Data Products Web page as Special Databases 22 and 23.⁴ The GOV2 collection is distributed by the University of Glasgow.⁵ Topics and qrels for these collections may be obtained from the TREC data archive.⁶

4www.nist.gov/srd

5ir.dcs.gla.ac.uk/test collections

6trec.nist.gov

Dalam dokumen Information Retrieval: Implementing and Evaluating Search Engines (Halaman 42-46)