The Harvest architecture - Digital Libraries

Harvest was a research project in distributed searching, led by Michael Schwartz who was then at the University of Colorado. Although the project ended in 1996, the architectural ideas that it developed remain highly relevant. The underlying concept is to take the principal functions that are found in a centralized search system and divide them into separate subsystems. The project defined formats and protocols for communication among these subsystems, and implemented software to demonstrate their use.

A central concept of Harvest is a gatherer. This is a program that collects indexing information from digital library collections. Gatherers are most effective when they are installed on the same system as the collections. Each gatherer extracts indexing information from the collections and transmits it in a standard format and protocol to programs called brokers. A broker builds a combined index with information about

many collections.

The Harvest architecture is much more efficient in its use of network resources than indexing methods that rely on web crawlers, and the team developed caches and methods of replication for added efficiency, but the real benefit is better searching and information discovery. All gatherers transmit information in a specified protocol, called the Summary Object Interchange Format (SOIF), but how they gather the information can be tailored to the individual collections. While web crawlers operate only on open access information, gatherers can be given access privileges to index restricted collections. They can be configured for specific databases and need not be restricted to web pages or any specific format. They can incorporate dictionaries or lexicons for specialized topic areas. In combination, these are major advantages.

Many benefits of the Harvest architecture are lost if the gatherer is not installed locally, with the digital library collections. For this reason, the Harvest architecture is particularly effective for federated digital libraries. In a federation, each library can run its own gatherer and transmit indexing information to brokers that build consolidated indexes for the entire library, combining the benefits of local indexing with a central index for users.

Another area of research is to develop methods for restricting searches to the most promising collections. Users rarely want to search every source of information on the Internet. They want to search specific categories, such as monograph catalogs, or indexes to medical research. Therefore, some means is needed for collections to provide summaries of their contents. This is particularly important where access is limited by authentication or payment mechanisms. If open access is provided to a source, an external program can, at least in theory, generate a statistical profile of the types of material and the vocabulary used. When an external user has access only through a search interface such analysis is not possible.

Luis Gravano, while at Stanford University, studied how a client can combine results from separate search services. He developed a protocol, known as STARTS, for this purpose. This was a joint project between Stanford University and several leading Internet companies. The willingness with which the companies joined in the effort shows that they see the area as fundamentally important to broad searching across the Internet. A small amount of standardization would lead to greatly improved searching.

In his analysis, Gravano viewed the information on the Internet as a large number of collection of materials, each organized differently and each with its own search engine. The fundamental concept is to enable clients to discover broad characteristics of the search engines and the collections that they maintain. The challenge is that the search engines are different and the collections have different characteristics. The difficulty is not simply that the interfaces have varying syntaxes, so that a query has to be re-formulated to be submitted to different systems. The underlying algorithms are fundamentally different. Some use Boolean methods; others have methods of ranking results. Search engines that return a ranked list give little indication how the ranks were calculated. Indeed, the ranking algorithm is often a trade secret. As a result it is impossible to merge ranking lists from several sources into a single, overall list with sensible ranking. The ranking are strongly affected by the words used in a collection, so that, even when two sources use the same ranking algorithm, merging them is fraught with difficulty. The STARTS protocol enables the search engines to report characteristics of their collections and the ranks that they generate, so that a client program can attempt to combine results from many sources.

Beyond searching

Information discovery is more than searching. Most individuals use some combination of browsing and systematic searching. Chapter 10 discussed the range of requirements that users have when looking for information and the difficulty of evaluating the effectiveness of information retrieval in an interactive session with the user in the loop. All these problems are aggravated with distributed digital libraries.

Browsing has always been an important way to discover information in libraries. It can be as simple as going to the library shelves to see what books are stored together.

A more systematic approach is to begin with one item and then move to the items that it refers to. Most journal articles and some other materials include list of references to other materials. Following these citations is an essentially part of research, but is a tedious task when the materials are physical objects that must be retrieved one at a time. With hyperlinks, following references becomes straightforward. A gross generalization is that following links and references is easier in digital libraries, but the quality of catalogs and indexes is higher in traditional libraries. Therefore, browsing is likely to be relatively more important in digital libraries.

If people follow a heuristic combination of browsing and searching, using a variety of sources and search engines, what confidence can they have in the results? This chapter has already seen the difficulties of comparing results obtained from searching different sets of information and deciding whether two items found in different sources are duplicates of the same information. For the serious user of a digital library there is a more subtle but potentially more serious problem. It is often difficult to know how comprehensive a search is being carried out. A user who searches a central database, such as the National Library of Medicine's Medline system, can be confident of searching every record indexed in that system. Contrast this with a distributed search of a large number of datasets. What is the chance of missing important information because one dataset is behind the others in supplying indexing information, or fails to reply to a search request?

Overall, distributed searching epitomizes the current state of digital libraries. From one viewpoint, every technique has serious weaknesses, the technical standards have not emerged, the understanding of user needs is embryonic, and organizational difficulties are pervasive. Yet, at the same time, enormous volumes of material are accessible on the Internet, web search programs are freely available, federations and commercial services are expanding rapidly. By intelligent combination of searching and browsing, motivated users can usually find the information they seek.

Chapter 12 Object models, identifiers, and

Dalam dokumen Digital Libraries (Halaman 176-179)