• Tidak ada hasil yang ditemukan

Overview.of.TREC

History.and.Background

The Text REtrieval Conference (TREC) is sponsored by three agencies—the U.S.

National Institute of Standards and Technology (NIST), the U.S. Department of Defense, Advanced Research Projects Agency (DARPA), and the U.S. intelligence community’s Advanced Research and Development Activity (ARDA)—to promote text retrieval research based on large test collections. Overviews of TREC (Harman

& Voorhees, 2006; Voorhees & Harman, 2005) and the TREC Web site (trec.nist.

gov) have provided a comprehensive review of TREC conferences. This section is compiled based on these resources. TREC started in 1992 with 25 participating groups, including the leading text retrieval groups, to search two gigabytes of text.

For each TREC, NIST offers a test collection and questions. Participating teams follow the guidelines, run the data on their own IR systems, and return the results

to NIST. NIST evaluates the submitted results and organizes workshops for partici- pants to discuss their experience and present results. By the end of 2005, 14 TREC conferences had been held.

According to the TREC Web site, the objective of TREC is to achieve the follow- ing four main goals:

• To encourage research in text retrieval based on large test collections;

• To increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas;

• To speed the transfer of technology from research labs into commercial prod- ucts by demonstrating substantial improvements in retrieval methodologies on real-world problems; and

• To increase the availability of appropriate evaluation techniques for use by industry and academia, including the development of new evaluation techniques more applicable to current systems (http://trec.nist.gov/overview.html).

Types.of.Tracks

Table 6.1 (as shown in Voorhees, 2006, p. 7) reviews the number of participants per track and total number of distinct participants in each TREC. Adapting and expanding Voorhees and Harman’s classification of tracks (2005, pp. 8-13) as well as examining the TREC home page (http://trec.nist.gov), the author summarizes all the tracks of TREC to 2005 and the types of tasks performed in TREC.

The tasks performed in TREC consist mainly of the following:

Static.text: The Ad Hoc Track is a typical document retrieval task on a static collection of text documents. The Robust Retrieval Track reintroduces the traditional ad hoc retrieval tasks, but the evaluation focus is on topic effective- ness instead of average effectiveness.

Streaming.text:.The Filtering Track and Routing Track deal with retrieving documents from a stream of text. While the purpose of the Routing Track is to formulate a basic task, the Filtering Track occurs afterward to make binary decisions about whether to retrieve a document that should be retrieved. The Spam Track is similar to the Filtering Track but focuses more on general e- mail filtering.

Human-oriented: The Interactive Track investigates users’ interaction with IR systems focusing on the process and the results. The Interactive Track, which started in TREC 3, became the interactive part of the Web track in TREC 12.

TREC and Interactve Track Envronments

Some groups joined the High Accuracy Retrieval from Documents (HARD) track. The purpose of the HARD track is to support users by providing accurate results to specific users.

Multi-language: The Spanish, Chinese, and Cross-language tracks focus on non-English retrieval. While the Spanish and Chinese Tracks concentrate on issues related to retrieval information in Spanish and Chinese, the Cross- Table 6.1. Number of participants per track and total number of distinct participants in each TREC. From “Overview of TREC 2005” by E. M. Voorhees. The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings (p. 7). NIST Special Publica- tion 500-266 Gaithersburg, MD: U.S. Department of Commerce, NIST.

Track 92 93 94 95 96 97 98 99 00 01 02 03 04 05

Ad hoc 18 24 26 23 28 31 42 41 ― ―

Routing 16 25 25 15 16 21 ― ―

Interactive 3 11 2 9 8 7 6 6 6

Spanish 4 10 7 ― ―

Confusion 4 5 ― ―

Database merging 3 3 ― ―

Filtering 4 7 10 12 14 15 19 21

Chinese 9 12 ― ―

NLP 4 2 ― ―

Speech 13 10 10 3 ― ―

Cross-language 13 9 13 16 10 9

High precision 5 4 ― ―

Very large corpus ― ― 7 6 ― ―

Query ― ― 2 5 6 ― ―

Question

answering ― ―

20 28 36 34 33 28 33

Web ― ― 17 23 30 23 27 18

Video ― ― 12 19

Novelty ― ― 13 14 14

Genome ― ― ― ― 29 33 41

HARD ― ― ― ― 14 16 16

Robust ― ― ― ― 16 14 17

Terabyte 17 19

Enterprise 23

Spam 13

Total participants 25 31 33 36 38 51 56 66 69 87 93 93 103 117

language Track involves research on the information retrieval of documents regardless of their languages.

Multimedia.formats:.In the digital age, users retrieve information not limited to text, and they also try to find information in multimedia formats. The opti- cal Character Recognition Track and the Speech Recognition Track attempt to explore how to offer original data without errors or with reduced error rates.

The Video Track is devoted to research in the content-based retrieval of digital video independent of text.

Web.and.large.collection.searching:.The Very Large Corpus (VLC) track evaluates the speed with which retrieval results are displayed when searching for a very large collection. The Terabyte Track is another type of very large collection track. Its objective is to study whether traditional IR test-collec- tion-based evaluation can be applied to much larger collections. The Web track specifically examines search tasks on a collection set that represents a snapshot of the World Wide Web.

Answers,.not.documents: The Question Answering Track works on a higher level of information retrieval. Instead of providing a set of relevant documents, question-answering systems return answers to the questions.

Domain-oriented: The Genomics Track and the Legal Track study informa- tion retrieval in a specific domain to improve retrieval effectiveness.

Organization-oriented: The Enterprise Track investigates users’ search be- haviors in organizational environments.

Overview.of.Interactive.Track

The Interactive Track explores the complexity of interactive retrieval evaluation.

Hersh and Over (2001) pointed out that these studies bridged the “user-oriented”’

and “system-oriented” IR approaches even though they were limited by small sample sizes, small numbers of queries, laboratory settings, and less-than-ideal document collections. Over (2001) described the focuses of interactive track:

1. The searcher interacting with the IR system

2. The search behavior, search process, and interim results as well as final re- sults

3. The effects of system, topic, and searcher, and their interactions 4. The assessment of the evaluation methodology

TREC and Interactve Track Envronments

In the special issue of Information Processing and Management dedicated to the Interactive Track, Over (2001) provided an overview of the history and development of the Interactive Track, as well as annotated bibliography of it, from TRECs 3-8.

Dumais and Belkin (2005) highlighted the key developments in each Interactive Track in addition to presenting general information about participants, approaches, tasks, and methods. Moreover, they further illustrated the challenges and new research directions in evaluating interactive information retrieval systems in the context of TREC. Each year’s interactive track report (part of the overview of TRECs 3-4 and TRECs 5-12) in the annual proceedings of TRECs outlined detailed information about each Interactive Track’s background, design, participants, results, and discussion.

Beginning with TREC3, the Interactive Track began to gain experience with the evaluation of interactive information retrieval systems. Four groups participated in the track to test either the tools needed for the IR systems for the Interactive Track or how users interact with new techniques based on TREC3 routing topics (Harman, 1995). There were no specific protocols and guidelines for participants to follow. The objective of TREC3 was to compare the performance of interactive IR systems to fully automatic routing systems (Over, 2001). In TREC4, 11 teams involved in the Interactive Track employed a subset of the ad hoc topics (Harman, 1996). The participants followed the same guidelines for search topics, tasks, and results recording. This interactive track tested new interfaces and compared the results of interactive ad hoc searches with automatic searching while focusing on the interactive search process, behavior, results, and methodologies (Dumais &

Belkin, 2005). In TREC5 and TREC6, comparison of experimental systems to a common system was a theme (Over, 1997, 1998). Two teams did the pilot study in TREC5, and nine groups took part in TREC6. The Interactive Track from TREC6 through TREC8 used the aspectual/instance recall task as a common task. Users had to identify as many aspects (in TREC6) or instances (in TREC7 and 8) as pos- sible for each topic.

While TREC6 represents the first true cross-site comparison in the Interactive Track, in TREC7, cross-site comparison was dropped because it was difficult to have a direct cross-site comparison considering the requirements of the Interactive Track. In TREC7 and TREC8, the searchers needed to save documents containing as many instances as possible within a 15-20-minute timeframe. A small set of ad hoc topics was used for TREC7 and 8, and eight and seven groups, respectively, engaged in these two TRECs (Over, 1999; Hersh & Over, 2000). In order to reduce the overall length of a search session and explore more tasks and collections, six teams participated in TREC 9 working on the fact-finding task. Some teams experi- mented with different document presentation interfaces. In TREC10 and TREC11, six groups did their individual experiments on Web searching (Hersh & Over, 2002, 2003). The TREC Web-track collection was used as a common collection for the comparability of results in TREC11. The Interactive Track became a subtrack of

the Web track in TREC12 (Craswell et al., 2004). At the same time, some teams took part in the HARD track.

Compared with other tracks, the Interactive Track has its own uniqueness in dealing with users’ interaction with IR systems. The TREC structure is not appropriate for research on interactive IR. Dumais and Belkin (2005) identified two reasons for the fundamental problem. The first is that TREC protocol is designed for evaluating and comparing batch searching; it is not well suited for the interactive environment.

Second, while TREC is designed to compare the performance of IR systems across sites, the performance of interactive IR is affected by searcher characteristics.

However, the searchers are limited by the number of topics they can search in each experiment. Interaction effects among searcher, topic, and system further complicate cross-site comparison.

Types.of.Interactive.Studies

Since the Interactive Track became part of a subtrack of the Web track and HARD track beginning with TREC12, the author identified the five main themes that emerged from studies of the Interactive Track from TREC3 to TREC11: (1) the impact of searchers’ knowledge vs. the impact of the dimensions of tasks, (2) query formulation and reformulation: relevance feedback and query length, (3) search tactics and strategies, (4) results organization structure and delivery mechanisms, and (5) the comparison of different retrieval models and evaluation methods..This section focuses on the different approaches applied by TREC participants and the associated results of the interactive studies performed in the Interactive Track. In addition, this section also covers research on interactive multilingual/cross-language information retrieval (CLIR), mainly in the interactive track of Cross-Language Evaluation Forum (iCLEF).

The.Impact.of.Searchers’.Knowledge.vs..the.Impact.of.the.

Dimensions.of.Tasks.

Research has demonstrated that domain knowledge affects users’ information-seeking behavior/strategies in environments of OPACs, online databases, Web search engines, and digital libraries. In TREC10, one finding that emerged from several research groups is that domain expertise influences search behavior/strategies (Dumais &

Belkin, 2005). Bhavnani (2002) identified the cognitive components of domain-spe- cific search knowledge and their impact on search behavior in the Interactive Track.

Five information retrieval experts performed tasks within and outside their domain

TREC and Interactve Track Envronments

of expertise. The results showed that searchers applied more effective declarative and procedural components of domain-specific search knowledge when searching tasks within their domains; they employed less effective general-purpose search methods when searching tasks outside their domains. The declarative components include three types of knowledge: classification knowledge of classes of Web sites, URL knowledge, and content knowledge. Procedural components consist of two types of knowledge: sequencing knowledge, which determines a search plan, and termination knowledge, which determines the exit point in accomplishing a search task. The findings of this study demonstrated that expert users were more effective when they were able to apply domain-specific search knowledge than when they could only employ domain-general knowledge. The results also indicated that general- purpose search engines could not effectively support domain-specific search tasks.

The major contribution of this study is the identification of cognitive components of domain-specific search knowledge, but the study is limited by its small sample.

More research needs to test the generalizability of the results.

However, in two of the other studies in TREC10, the results demonstrated that the domain of the task, instead of searchers’ domain knowledge, affects searchers’

perception and behavior. After analyzing 48 nonexpert participants’ searching on shopping, medicine, travel, and research topics, Toms, Kopak, Bartlett, and Freund (2002) found that the domain of the task had little effect on search results, but it did have an effect on user perception of their difficulty and satisfaction with results. The shopping tasks were more difficult to accomplish and less satisfying than the other tasks. Hersh, Sacherek, and Olson (2002) observed 24 experienced searchers per- forming searches on their choices of Web tools. They found that domain of the task affected the searchers’ behavior; for example, searchers took the most time and the most page views for shopping tasks among all the tasks. They also reported results to those of Toms et al. (2002): although the differences across different tasks were small, the domain of the task influenced users’ perceptions. In both of the studies, shopping tasks affected searchers’ perceptions or behaviors, but this type of task was not the one that the searchers were least familiar with. In other words, search- ers’ domain knowledge on shopping tasks was not the lowest, but they still found shopping tasks the most difficult among all the tasks. The question is whether the domain knowledge of a searcher or the nature of the task itself, or both, influence searchers’ perceptions or behaviors.

In addition to domain knowledge, searchers’ spatial visualization ability and its impact on the success of searches were also explored. Even though no significant difference was found, the results indicated that searcher differences in spatial visu- alization ability were predictive of search success (Hersh et al., 2001; Hersh, Moy, Kraemer, Sacherek, & Olson, 2003). The nature of the TREC experiment, with its short cycle for experimentation, especially sample sizes, makes it difficult to achieve needed statistical power.

The dimensions of tasks have been regarded as the essential components for inter- active information retrieval, and they are demonstrated to be influential factors for system performance and human behavior in a variety of digital environments. In addition to the domain of the tasks discussed above, the level of complexity of the tasks and the timeframe of the tasks and their relationships with the effectiveness of different interactive features of IR systems and system performance are also in- vestigated in the Interactive Track. In TREC8, Beaulieu, Fowkes, Alemayehu, and Sanderson (2000) found that the impact of query expansion depended on nature of the task. While automatic query expansion improved the results for simple topics, complex questions required interactive query expansion and contributions from both the searcher and the system, because users had to examine the documents more carefully for complicated topics. At the same time, the effectiveness of fea- tures facilitating relevance judgments, such as displaying query term information in the retrieval results and highlighting best passages and query terms in documents, was also affected by the level of complexity of the task. These features were more helpful in assisting users in making relevance judgments for simple topics than complicated topics.

In TREC 9, Beaulieu, Fowkes, and Joho (2001) focused on the characteristics of two types of tasks and their impact on searcher and system performance. While the first type of tasks required searchers to find as many different instances as possible, the second type required searchers to choose a single correct answer from two possible choices. Searchers were required to accomplish each search topic within 5 minutes.

After comparing their results with the overall results of the Interactive Track, they found that time and type of task were the two interdependent success factors in ad- dition to searcher characteristics and behavior. More searchers indicated that they did not have enough time to accomplish type 1 tasks than type 2 tasks. Unlike in TREC8, the searchers’ engagement with the documents was not evident because of the time limitation. It seemed more demanding for searchers to find different instances than to find the single answer within 5 minutes. The short time element was deemed a more important possible success factor than the complexity of the topic. Time is another dimension of task in addition to level of complexity of task and domain of task.

Query.Formulation.and.Reformulation:.Relevance...

Feedback.and.Query.Length

Query formulation and reformulation is a difficult task in the interactive information retrieval process. Relevance feedback is known as one of the effective approaches to support query formulation and reformulation. In the Interactive Track, relevance feedback is a main topic for research. Interactive studies explored different approaches

TREC and Interactve Track Envronments

to providing relevance feedback, such as automatic query expansion, term selection, passage feedback, explicit feedback, and implicit feedback.

Relevance Feedback: Automatic Query Expansion

Automatic query expansion is a classical approach for relevance feedback. According to Robertson, Walker, and Beaulieu (2000), Okapi interactive experiments focused on the user search process. The objectives were to (1) support user query expan- sion, and (2) determine how and when users engage in the search process. Query expansion in an incremental format was used in TRECs 5 and 6 in which the system extracted terms and automatically added to the working query; correspondingly, all the terms were reweighted when a searcher made a positive relevance feedback.

Interestingly, the interactive experimental system (Okapi) did not perform better than the controlled system (ZPRISE), mainly because query expansion is more useful for finding items that are the same instead of different from those identified as relevant. At the same time, users were more satisfied with the search outcomes derived from the experimental system than the controlled system partly because users liked the support offered by the experimental system.

In TREC 7, Robertson, Walker, and Beaulieu (1999) conducted a three-way com- parison between two versions of Okapi (one with relevance feedback, and another one without relevance feedback) and a control system (ZPRISE). The findings of TREC 7 echoed TREC 6 results. The Okapi with relevance feedback outperformed Okapi without relevance feedback on both precision and recall. However, the control system (ZPRISE) achieved better results than Okapi with relevance feedback. Even though recall was marginally better, the difference was in precision. In TREC 8, the results of comparison of Okapi with and without relevance feedback revealed that Okapi with relevance feedback was marginally better in precision but worse in recall than Okapi without relevance feedback (Beaulieu et al., 2000). Interestingly, Beaulieu et al. (2000) found that the results depended on the complexity of search topics; more specifically, automatic query expansion could improve the results of simple, straightforward topics while interactive query expansion plus both system and user contributions were needed for complicated topics.

Relevance Feedback: Term Selection

Belkin et al. (2001) suggested a new revisionist model of relevance feedback taking account of people’s information-seeking behaviors in interactive IR. Important terms in negatively judged documents that do not occur in positively judged documents are considered indicators of the inappropriate topic. While in TREC 5 relevance feedback was accomplished automatically, and the results showed that searchers