the partners may have very different systems, so long as they support an agreed set of services. They will need to agree both on technical standards and on policies,
including financial agreements, intellectual property, security, and privacy.
Research at the University of Illinois, Urbana Champaign provides a revealing example of the difficulties of interoperability. During 1994-98, as part of the Digital Libraries Initiative, a team based at the Grainger Engineering Library set out to build a federated library of journal articles from several leading science publishers. Since each publisher planned to make its journals available with SGML mark-up, this appeared to be an opportunity to build a federation; the university would provide central services, such as searching, while the collections would be maintained by the publishers. This turned out to be difficult. A basic problem was incompatibility in the way that the publishers use SGML. Each has its own Document Type Definition (DTD). The university was forced to enormous lengths to reconcile the semantics of the DTDs, both to extract indexing information and to build a coherent user interface.
This problem proved to be so complex that the university resorted to copying the information from the publishers' computers onto a single system and converting it to a common DTD. If a respected university research group encountered such difficulties with a relatively coherent body of information, it is not surprising that others face the same problems. Panel 11.2 describes this work in more detail.
Panel 11.2
The University of Illinois federated library of
Because of technical difficulties, the first implementation loaded all the documents into a single repository at the University of Illinois. Future plans call for the federation to use repositories maintained by individual publishers. There is also interest in expanding the collections to embrace bibliographic databases, catalogs, and other indexes.
Even the first implementation proved to be a fertile ground for studying users and their wishes. Giving users more powerful methods of searching was welcomed, but also stimulated requests. Users pointed out that figures or mathematical expressions are often more revealing of content than abstracts of conclusions. The experiments have demonstrated, once again, that users have great difficulty finding the right words to include in search queries when there is no control over the vocabulary used in the papers, their abstracts and the search system.
Online catalogs and Z39.50
Many libraries have online catalogs of their holding that are openly accessible over the Internet. These catalogs can be considered to form a federation. As described in Chapter 3, the catalog records follow the Anglo American Cataloguing Rules, using the MARC format, and libraries share records to reduce the costs. The library
community developed the Z39.50 protocol to meets its needs for sharing records and distributed searching; Z39.50 is described in Panel 11.3. In the United States, the Library of Congress, OCLC and the Research Libraries Group, have been active in developing and promulgating these standards; there have been numerous independent implementations at academic sites and by commercial vendors. The costs of belonging to this federation are high, but they have been absorbed over decades, and are
balanced by the cost savings from shared cataloguing.
Panel 11.3 Z39.50
Z39.50 is a protocol, developed by the library community, that permits one computer, the client, to search and retrieve information on another, the database server. Z39.50 is important both technically and for its wide use in library systems. In concept, Z39.50 is not tied to any particular category of information or type of database, but much of the development has concentrated on bibliographic data. Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC catalog records and present them to the client.
Z39.50 is built around a abstract view of database searching. It assumes that the server stores a set of databases with searchable indexes. Interactions are based on the concept of a session. The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. During the course of the session, both the server and the client remember the state of their interaction. It is important to understand that the client is a computer. End-user applications of Z39.50 need a user interface for communication with the user. The protocol makes no statements about the form of that user interface or how it connects to the Z39.50 client.
A typical session begins with the client connecting to the server and exchanging initial information, using the init facility. This initial exchange establishes agreement on basics, such as the preferred message size; it can include authentication, but the actual form of the authentication is outside the scope of the standard. The client might then use the explain service to inquire of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other
options.
The search service allows a client to present a query to a database, such as:
In the database named "Books" find all records for which the access point title contains the value "evangeline" and the access point author contains the value "longfellow."
The standard provides several choices of syntax for specifying searches, but only Boolean queries are widely implemented. The server carries out the search and builds a results set. A distinctive feature of Z39.50 is that the server saves the results set. A subsequent message from the client can reference the result set. Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching the entire database.
Depending on parameters of the search request, one or more records may be returned to the client. The standard provides a variety of ways that clients can manipulate results sets, including services to sort or delete them. When the searching is complete, the next step is likely to be that the client sends a present request. This requests the server to send specified records from the results set to the client in a specified format. The present service has a wide range of options for controlling content and formats, and for managing large records or large results sets.
This is a large and flexible standard. In addition to these basic services, Z39.50 has facilities for browsing indexes, for access control and resource management, and supports extended services that allow a wide range of extensions.
One of the principal applications of Z39.50 is for communication between servers. A catalog system at a large library can use the protocol to search a group of peers to see if they have either a copy of a work or a catalog record. End users can use a single Z39.50 client to search several catalogs, sequentially, or in parallel. Libraries and their patrons gain considerable benefits from sharing catalogs in these ways, yet interoperability among public access catalogs is still patchy. Some Z39.50 implementations have features that others lack, but the underlying cause is that the individual catalogs are maintained by people whose first loyalty is to their local communities. Support for other institutions is never the first priority. Even though they share compatible versions of Z39.50, differences in how the catalogs are organized and presented to the outside world remain.
NCSTRL and Dienst
A union catalog is a single catalog that contains records about the materials in several libraries. Union catalogs were used by libraries long before computers. They solve the problem of distributed searching by consolidating the information to be searched into a single catalog. Web search services can be considered to be a union catalogs for the web, albeit with crude catalog records. An alternative method of distributed searching is not to build a union catalog, but for each collection to have its own searchable index. A search program sends queries to these separate indexes and combines the results for presentation to the user.
Panel 11.4 describes an interesting example. The Networked Computer Science Technical Reference Library (NCSTRL) is a federation of digital library collections that are important to computer science researchers. It uses a protocol called Dienst.
To minimize the costs of acceptance, Dienst builds on a variety of technical standards that are familiar to computer scientists, who are typically heavy users of Unix, the
Internet, and the web. The first version of Dienst sent search requests to all servers. As the number of servers grew this approach broke down; Dienst now uses a search strategy that makes use of a master index, which is a type of union catalogs. For reasons of performance and reliability, this master index is replicated at regional centers.