Distributed information discovery Distributed computing and interoperability

This chapter is about distributed information retrieval, seeking for information that is spread across many computer systems. This is part of the broad challenge of interoperability, which is an underlying theme of this entire book. Therefore, the chapter begins with a discussion of the general issues of interoperability.

The digital libraries of the world are managed by many organizations, with different management styles, and different attitudes to collections and technology. Few libraries have more than a small percentage of the materials that users might want. Hence, users need to draw from collections and services provided by many different sources.

How does a user discover and have access to information when it could be drawn from so many sources?

The technical aspects of coordinating separate computers so that they provide coherent service is called distributed computing. Distributed computing requires that the various computers share some technical standards. With distributed searching, for example, a user might want to search many independent collections with a single query, compare the results, choose the most promising, and retrieve selected materials from the collections. Beyond the underlying networking standards, this requires some method of identifying the collections, conventions for formulating the query, techniques for submitting it, means to return results so that they can be compared, and methods for obtaining the items that are discovered. The standards may be formal standards, blessed by official standards bodies, or they may be local standards developed by a small group of collaborators, or agreements to use specific commercial products. However, distributed computing is not possible without some shared standards.

An ideal approach would be to develop a comprehensive set of standards that all digital libraries would adopt. This concept fails to recognize the costs of adopting standards, especially during times of rapid change. Digital libraries are in a state of flux. Every library is striving to improve its collections, services, and systems, but no two libraries are the same. Altering part of a system to support a new standard is time- consuming. By the time the alteration is completed there may be a new version of the standard, or the community may be pursuing some other direction. Comprehensive standardization is a mirage.

The interoperability problem of digital libraries is to develop distributed computing systems in a world where the various computers are independently operated and technically dissimilar. Technically, this requires formats, protocols, and security systems so that messages can be exchanged. It also requires semantic agreements on the interpretation of the messages. These technical aspects are hard, but the central challenge is to find approaches that independent digital libraries have incentives to incorporate. Adoption of shared methods provides digital libraries with extra functionality, but shared methods also bring costs. Sometimes the costs are directly financial: the purchase of equipment and software, hiring and training staff. More often the major costs are organizational. Rarely can one aspect of a digital library be

changed in isolation. Introducing a new standard requires inter-related changes to existing systems, altered work flow, changed relationships with suppliers, and so on.

Figure 11.1. Strategies for distributed searching: function versus cost of adoption

Figure 11.1 shows a conceptual model that is useful in thinking about interoperability;

in this instance it is used to compare three methods of distributed searching. The horizontal axis of the figure indicates the functionality provided by various methods.

The vertical axis indicates the costs of adopting them. The ideal methods would be at the bottom right of the graph, high functionality at low cost. The figure shows three particular methods of distributed searching, each of which is discussed later in this chapter. The web search programs have moderate functionality; they are widely used because they have low costs of adoption. Online catalogs based on MARC cataloguing and the Z39.50 protocol have much more function, but, because the standards are complex, they are less widely adopted. The NCSTRL system lies between them in both function and cost of adoption.

More generally. it is possible to distinguish three broad classes of methods that might be used for interoperability.

x Most of the methods that are in widespread use for interoperability today have moderate function and low cost of acceptance. The main web standards, HTML, HTTP, and URLs, have these characteristics. Their simplicity has led to wide adoption, but limits the functions that they can provide.

x Some high-end services provide great functionality, but are costly to adopt.

Z39.50 and SGML are examples. Such methods are popular in restricted communities, where the functionality is valued, but have difficulty in penetrating into broader communities, where the cost of adoption becomes a barrier.

x Many current developments in digital libraries are attempts to find the middle ground, substantial functionality with moderate costs of adoption. Examples include the Dublin Core, XML, and Unicode. In each instance, the designers have paid attention to providing a moderate cost route for adoption. Dublin Core allows every field to be optional. Unicode provides UTF-8, which

accepts existing ASCII data. XML reduces the cost of adoption by its close relationships with both HTML and SGML.

The figure has no scale and the dimensions are only conceptual, but it helps to understand the fundamental principle that the costs of adopting new technology are a factor in every aspect of interoperability. Technology can never be considered by itself, without studying the organizational impact. When the objective is to interoperate with others, the creators of a digital library are often faced with the choice between the methods that are best for their particular community and adopting more generally accepted standards, even though they offer lesser functionality.

New versions of software illustrate this tension. A new version will often provide the digital library with more function but fewer users will have access to it. The creator of a web site can use the most basic HTML tags, a few well-established formats, and the services provided by every version of HTTP; this results in a simple site that can be accessed by every browser in the world. Alternatively, the creator can choose the latest version of web technology, with Java applets, HTML frames, built-in security, style sheets, with audio and video inserts. This will provide superior service to those users who have high-speed networks and the latest browsers, but may be unusable by others.

Web search programs

The most widely used systems for distributed searching are the web search programs, such as Infoseek, Lycos, Altavista, and Excite. These are automated systems that provide an index to materials on the Internet. On the graph in Figure 11.1, they provide moderate function with low barriers to use: web sites need take no special action to be indexed by the search programs, and the only cost to the user is the tedium of looking at the advertisements. The combination of respectable function with almost no barriers to use makes the web search programs extremely popular.

Most web search program have the same basic architecture, though with many differences in their details. The notable exception is Yahoo, which has its roots in a classification system. The other systems have two major parts: a web crawler which builds an index of material on the Internet, and a retrieval engine which allows users on the Internet to search the index.

Web crawlers

The basic way to discover information on the web is to follow hyperlinks from page to page. A web indexing programs follows hyperlinks continuously and assembles a list of the pages that it finds. Because of the manner in which the indexing programs traverse the Internet, they are often called web crawlers.

A web crawler builds an ever-increasing index of web pages by repeating a few basic steps. Internally, the program maintains a list of the URLs known to the system, whether or not the corresponding pages have yet been indexed. From this list, the crawler selects the URL of an HTML page that has not been indexed. The program retrieves this page and brings it back to a central computer system for analysis. An automatic indexing program examines the page and creates an index record for it which is added to the overall index. Hyerlinks from the page to other pages are extracted; those that are new are added to the list of URLs for future exploration.

Behind this simple framework lie many variations and some deep technical problems.

One problem is deciding which URL to visit next. At any given moment, the web

crawler has millions of unexplored URLs, but has little information to know which to select. Possible criteria for choice might include currency, how many other URLs link to the page, whether it is a home page or a page deep within a hierarchy, whether it references a CGI script, and so on.

The biggest challenges concern indexing. Web crawlers rely on automatic indexing methods to build their indexes and create records to present to users. This was a topic discussed in Chapter 10. The programs are faced with automatic indexing at its most basic: millions of pages, created by thousands of people, with different concepts of how information should be structured. Typical web pages provide meager clues for automatic indexing. Some creators and publishers are even deliberately misleading;

they fill their pages with terms that are likely to be requested by users, hoping that their pages will be highly ranked against common search queries. Without better structured pages or systematic metadata, the quality of the indexing records will never be high, but they are adequate for simple retrieval.

Searching the index

The web search programs allow users to search the index, using information retrieval methods of the kind described in Chapter 10. The indexes are organized for efficient searching by large numbers of simultaneous users. Since the index records themselves are of low quality and the users likely to be untrained, the search programs follow the strategy of identifying all records that vaguely match the query and supplying them to the user in some ranked order.

Most users of web search programs would agree that they are remarkable programs, but have several significant difficulties. The ranking algorithms have little information to base their decisions on. As a result, the programs may give high ranks to pages of marginal value; important materials may be far down the list and trivial items at the top. The index programs have difficulty recognizing items that are duplicates, though they attempt to group similar items; since similar items tend to rank together, the programs often return long lists of almost identical items. One interesting approach to ranking is to use link counts. Panel 11.1 describes Google, a search system that has used this approach. It is particularly effective in finding introductory or overview material on a topic.

Panel 11.1

Dalam dokumen Digital Libraries (Halaman 166-169)