Organization Structure: Clustering, Ranked List, and Integration of the Two
As tasks normally lead to information retrieval, it is important to know how to design interfaces to be comparable with task structure and what the different ap- proaches are for the design of interfaces that facilitate interactions between users and IR systems. Wu, Fuller, and Wilkinson (2001) conducted a series of experiments investigating whether applying clustering and classification to present information with respect to task structure facilitated interactive retrieval. Experiment I was to examine whether a clustering algorithm could group the retrieved documents and whether users could select relevant clusters. The findings suggested that the cluster algorithm could group topic-relevant documents but could not separate documents with instance relevance. Experiment II concentrated on the comparison of two interfaces implemented: one based on clusters and the other on ranked list. There is no significant difference between two systems on the average instance recall.
However, for five topics, the cluster organization is better than the list organization interface. The results also showed that users did prefer structured presentations of a retrieved result set rather than a list-based approach.
Experiment III examined whether clustering assisted users in performing instance retrieval tasks. There was no significant difference in instance recall between two interfaces. At the same time, the results suggested that there were variations in mental maps betweens subjects and assessors. Experiment IV explored simple document classification to replace unguided clustering, for instance retrieval task. Although there was no significant difference between the classification-based interface and the ranked-list interface, searchers did save more instances on average using the classification-based interface. The findings also suggested that the organization of retrieved documents affected searchers’ perception of the documents. Searchers were more satisfied with classification-based interface in terms of its presentation form, the retrieved data, its ease-of-use, and the time available for searching.
Taking another approach to enhance the power of the clustering technique, Osdin, Ounis, and White (2003) designed a system (HuddleSearch) that used hierarchical clustering and summarization approaches to help users interact with the system. Users were able to judge a cluster’s relevance before viewing its content. The experiment compared the system and a baseline with the classical list-based approach. Even though some of the results were not statically significant, the findings of the study clearly showed that the experimental system performed better than the base system in terms of fewer number of incomplete tasks and less time to accomplish tasks on average. More important, 13 of 16 users preferred HuddleSearch to the baseline, and they were more satisfied with the results provided by the experimental system.
TREC and Interactve Track Envronments
Overall users did prefer clustering to ranked list in presenting retrieved results, although no statistical significant difference was found between the two systems in their performances.
Considering the tradeoff of clustering and ranked list, Allan, Leuski, Swan, and Byrd (2001) combined clustering with the traditional ranked list to overcome the problems of only providing ranked list or clustering and have the benefits of the two techniques. They first evaluated the effectiveness of two versions of the system in the TREC 6 Interactive Track: one with and another one without visualization that combines a ranked list with clustering. There was no significant advantage to using the visualization, although the researchers observed examples where the visualization offered valuable help. According to Allan et al. (2001), the reasons for the results cannot be detected in the Interactive Track environment because the value of visualization might be obscured by other variations in users and systems.
A new system was built to incorporate interdocument similarity visualization to the ranked list. Using the TREC collection and relevance judgments, they conducted a noninteractive study evaluating the performance of the ranked list, relevance feed- back, and the combination of ranked list and clustering. The results showed that the combination outperformed the ranked list. This approach is as powerful as the relevance feedback approach, but much easier for searchers to understand.
In TREC10, Craswell, Hawking, Wilkinson, and Wu (2002) further investigated the correlation between the three delivery mechanisms (a ranked list interface, a clustering interface, and an integrated interface with ranked list, clustering structure, and expert links) and two searching tasks (search for an individual document and a set of documents). They then conducted experiments with 24 subjects with three groups: Group 1 subjects were informed about the characteristics of each searching mechanism; Group 2 subjects were informed about the advantages of each search mechanism related to the type of tasks; and Group 3 subjects used two interfaces:
the ranked list interface and the clustering interface. The researchers found no significant difference among the groups in terms of the number of documents read. Subjects from Group 3 used the least time when using ranked list interface, probably because they concentrated on one interface without distraction. Overall, search tasks did not affect the use of delivery mechanism, and searchers only used one delivery mechanism.
In TREC 11, Craswell, Hawking, Wilkinson, and Wu (2003) continued working on the organization of retrieved documents. Based on 16 subjects’ searching on two types of interfaces, they compared the delivery method of traditional ranked list with a new organizational structure that applied level two domain labels and their corresponding organization names to classify the retrieved documents for the collection of U.S. government Web documents. The new organizational structure was developed based on the idea that people try to match their mental model about the organization with their information needs when accessing information from an organization’s Web site. The results showed that subjects read more documents with
the category interface than with the ranked list interface, which indicated that the category interface promoted more browsing behavior. Category interface also brought relevant documents that scattered in the ranked list to a category. Although there was no significant difference between the two delivery methods during the first 5 and 10 minutes of searching, the results did present a significantly better performance with the category interface at the end of the 15 minutes of searching.
The above studies suggest that organizational structure needs to be designed based on user mental models and tasks. More research on the design of organizational structure needs to not only improve users’ perceptions but also their search performances.
Display Methods and their Relationships with Interaction
The organizational structure only offers searchers an overview of the retrieved re- sults; the display method provides an opportunity for searchers to view documents or surrogates of documents. Belkin et al. (2001) compared two interfaces in terms of performance, effort, and user preference to test whether they are better at sup- porting one of the two types of tasks: comparison-type tasks and list-type tasks. One offered Single Document Display (SDD), presenting the top 10 document titles and the text of the first document; another provided Multiple Document Display (MDD), presenting the title and text of the top six documents that displayed the “best pas- sage” generated by the system. The analysis of 16 subjects’ experience with the two systems indicated that the MDD system did not support the comparison-type task better than the SDD system, and the SDD system did not support the list-type task better than the MDD system, based on performance and effort measures. Overall, the MDD system had a minor advantage in supporting the question-answering task over the SDD system..
In TREC 11, Belkin et al. (2003) continued their investigation of the relationship between the amount of interaction and the level of user satisfaction with search results and search performance. Specifically, they tried to test one hypothesis on this topic: a search interface that directly displays the ranked retrieved documents by a search will lead to less user system interaction than the one that displayed only ranked titles. Two interfaces were implemented into the study: one, MDD, with full-text available with information problem elicitation, and the other, SDD, with a list of ranked titles with regular query elicitation. The results reported that MDD resulted in less user interaction, and searchers were more satisfied with the search results and saved significantly more documents when searching with MDD than with SDD, even though the two interfaces did not lead to significant differences in terms of the number of complete and correct answers. The results indicated that reducing interaction for a searcher led to a better user experience. Once again, the results of these studies are common in the Interactive Track that no statistical significance
TREC and Interactve Track Envronments
was found in the studies, but the results did identify some interesting findings that warrant further research.
Display Methods for Relevance Judgments
How to assist users to effectively evaluate the relevance of retrieved documents is a critical research topic. Robertson, Walker, and Beaulieu (2000) found that highlighting the best passages of documents enabled searchers to effectively make relevance judgments; this was especially useful for long documents and documents with different topics. In TREC8, Beaulieu et al. (2000) further examined best-pas- sage retrieval and other related features, finding that these features were related to the nature of topics. Highlighting best passages and query terms in documents, as well as displaying query term information in the retrieved list, helped users make relevance judgments for simple topics, but this was less useful for more complicated topics because users had to examine the content of the documents more carefully.
Instead of using the existing passages from a document, Alexander et al. (2001) came up with the idea of applying query-biased summary. In TREC 9, they tested the effectiveness of query-biased summaries for question-answering tasks. The experimental system offered searchers short summaries of documents that consisted of main points of the original documents based on a query expressed by a searcher.
The findings showed that subjects performed better using the experimental system.
Although subjects found the same number of unique documents that supported the answer for a query in both systems, they spent less effort in discovering these supporting documents. All subjects favored the use of summaries; however, they disliked the long process it took for the summaries to be generated..The process of summary generation needs to be improved. Taking another approach, D’Souza et al. (2001) compared two types of summaries in two experimental systems: one used the title and the first 20 words (First20) of a document; another used the docu- ment title and the best three Answer Indicative Sentences (AIS3) extracted from the document. After analyzing 16 subjects’ transaction logs and questionnaires, they concluded that the summary with best three Answer Indicative Sentences was sig- nificantly better than the summary with first 20 words. The AIS3 system was more effective than the First20 system in terms of the number and the quality of saved answers. Even though there was not much difference in learning effort between the two systems, user perception of the usefulness of the AIS3 system was higher than the First20 system.
Delivery mechanisms are essential for assisting searchers to effectively evaluate the relevance of retrieved documents. Researchers in the TREC Interactive Track tested the effectiveness and usefulness of different approaches to organize and display retrieved results. These studies shed lights on how to design IR systems to support users to efficiently evaluate the retrieved results. The limitation of the
TREC setting, tasks, and sample size calls for the need to enhance these studies to improve their statistical power.