x Structural and administrative metadata is often stored with each digital object.
Such metadata can be actually embedded within the object.
x Some metadata refers to a group of objects. Administrative metadata used for access management may apply to an entire repository or a collection within a repository. Finding aids apply to many objects.
x Metadata may be stored as separate digital objects with links from the digital objects to which they apply. Some metadata is not stored explictly but is generated when required.
One of the uses of metadata is for interoperability, yet every digital library has its own ideas about the selection and specification of metadata. The Warwick Framework, described in Panel 13.2, is a conceptual framework that offers some semblance of order to this potentially chaotic situation.
Panel 13.2
system, but the ideas are appealing. The approach of dividing information into well- defined packages simplifies the specification of digital objects and provides flexibility for interoperability.
Protocols for interoperability
Interoperability requires protocols that clients use to send messages to repositories and repositories use to return information to clients. At the most basic level, functions are needed that deposit information in a repository and provide access. The implementation of effective systems requires that the client is able to discover the structure of the digital objects, different types of objects require different access methods, and access management may require authentication or negotiation between client and repository. In addition, clients may wish to search indexes within the repository.
Currently, the most commonly used protocol in digital libraries is HTTP, the access protocol of the web, which is discussed in Panel 13.3. Another widely used protocol is Z39.50; because of its importance in information retrieval, it was described in Chapter 11.
Panel 13.3.
HTTP
Chapter 2 introduced the HTTP protocol and described the get message type. A get message is an instruction from the client to the server to return whatever information is identified by the URL included in the message. If the URL refers to a process that generates data, it is the data produced by the process that is returned.
The response to a get command has several parts. It begins with a status, which is a three digit code. Some of these codes are familiar to users of the web because they are error conditions, such as 404, the error code returned when the resource addressed by the URL is not found. Successful status codes are followed by technical information, which is used primarily to support proxies and caches. This is followed by metadata about the body of the response. The metadata provides information to the client about the data type, its length, language and encoding, a hash, and date information. The client used this metadata to process the final part of the message, the response body, which is usually the file referenced by the URL.
Two other HTTP message types are closely related to get. A head message requests the same data as a get message except that the message body itself is not sent. This is useful for testing hypertext links for validity, accessibility, or recent modification without the need to transfer large files. The post message is used to extend the amount of information that a client sends to the server. A common use is to provide a block of data, such as for a client to submit an HTML form. This can then be processed by a CGI script or other application at the server.
The primary use of HTTP is to retrieve information from a server, but the protocol can also be used to change information on a server. A put message is used to store specified information at a given URL and a delete message is used to delete information. These are rarely used. The normal way to add information to a web server is by separate programs that manipulate data on the server directly, not by HTTP messages sent from outside.
Many of the changes that have been made to HTTP since its inception are to allow different versions to coexist and to enhance performance over the Internet. HTTP
recognizes that many messages are processed by proxies or by caches. Later versions include a variety of data and services to support such intermediaries. There are also special message types: options which allows a client to request information about the communications options that are available, and trace which is used for diagnostic and testing.
Over the years, HTTP has become more elaborate, but it is still a simple protocol.
The designers have done a good job in resisting pressures to add more and more features, while making some practical enhancements to improve its performance. No two people will agree exactly what services a protocol should provide, but HTTP is clearly one of the Internet's success stories.
Object-oriented programming and distributed objects
One line of research is to develop the simplest possible repository protocol that supports the necessary functions. If the repository protocol is simple, information about complex object types must be contained in the digital objects. (This has been called "SODA" for "smart object, dumb archives".)
Several advanced projects are developing architectures that use the computing concept of distributed objects. The word "object" in this context has a precise technical meaning, which is different from the terms "digital object" and "library object" used in this book. In modern computing, an object is an independent piece of computer code with its data, that can be used and reused in many contexts. The information within an object is encapsulated, so that the internals of the object are hidden. All that the outside world knows about a class of objects is a public interface, consisting of methods, which are operations on the object, and instance data. The effect of a particular method may vary from class to class; in a digital library a "render" method might have different interpretations for different classes of object.
After decades of development, object-oriented programming languages, such as C++
and Java, have become accepted as the most productive way to build computer systems. The driving force behind object-oriented programming is the complexity of modern computing. Object-oriented programming allows components to be developed and tested independently, and not need to be revised for subsequently versions of a system. Microsoft is a heavy user of object-oriented programming to develop its own software. Versions of its object-oriented environment are known variously as OLE, COM, DCOM, or Active-X. They are all variants of the same key concepts.
Distributed objects generalize the idea of objects to a networked environment. The basic concept is that an object executing on one computer should be able to interact with an object on another, through its published interface, defined in terms of methods and instance data. The leading computer software companies - with the notable exception of Microsoft - have developed a standard for distributed objects known as CORBA. CORBA provides the developers on distributed computing systems with many of the same programming amenities that object-oriented programming provides within a single computer.
The key notion in CORBA is an Object Request Broker (ORB). When an ORB is added to an application program, it establishes a client-server relationships between objects. Using an ORB, a client can transparently invoke a method on a server object, which might be on the same machine or across a network. The ORB intercepts the call; it finds an object that can implement the request, passes it the parameters, invokes its method, and returns the results. The client does not have to be aware of