Handles and Digital Object Identifiers

Handles are a naming system developed at CNRI as part of a general framework proposed by Robert Kahn of CNRI and Robert Wilensky of the University of California, Berkeley. Although developed independently from the ideas of URNs, the two concepts are compatible and handles can be considered the first URN system to be used in digital libraries. The handle system has three parts:

x A name scheme that allows independent authorities to create handle names with confidence that they are unique.

x A distributed computer system that stores handles along with data that they reference, e.g., the locations where material is stored. A handle is resolved by sending it to the computer system and receiving back the stored data.

x Administrative procedures to ensure high-quality information over long periods of time.

Syntax

Here are two typical handles:

hdl:cnri.dlib/magazine hdl:loc.music/musdi.139

These strings have three parts. The first indicates that the string is of type hdl:. The next, "cnri.dlib" or "loc.music", is a naming authority. The naming authority is assigned hierarchically. The first part of the name, "cnri" or "loc", is assigned by the central authority. Subsequent naming authorities, such as "cnri.dlib" are assigned locally. The final part of the handle, following the "/" separator, is any string of characters that are unique to the naming authority.

Computer system

The handle system offers a central computer system, known as the global handle registry, or permits an organization to set up a local handle service on its own computers, to maintain handles and provide resolution services. The only requirements are that the top-level naming authorities must be assigned centrally and that all naming authorities must be registered in the central service. For performance and reliability, each of these services can be spread over several computers and the data can be automatically replicated. A variety of caching services are provided as are plug-ins for web browsers, so that they can resolve handles.

Digital Object Identifiers

In 1996, an initiative of the Association of American Publishers adopted the handle system to identify materials that are published electronically. These handles are called Digital Object Identifiers (DOI). This led to the creation of an international foundation which is developing DOIs further. Numeric naming authorities are assigned to publishers, such as "10.1006" which is assigned to Academic Press. The following is the DOI of a book published by Academic Press:

doi:10.1006/0121585328

The use of numbers for naming authorities reflects a wish to minimize the semantic information. Publishers frequently reorganize, merge, or transfer works to other publishers. Since the DOIs persist through such changes, they should not contain the name of the original publisher in a manner that might be confusing.

Computer systems for resolving names

Whatever system of identifiers is used, there must be a fast and efficient method for a computer on the Internet to discover what the name refers to. This is known as resolving the name. Resolving a domain name provides the IP address or addresses of the computer system with that name. Resolving a URN provides the data associated with it.

Since almost every computer on the Internet has the software needed to resolve domain names and to manipulate URLs, several groups have attempted to build

systems for identifying materials in digital libraries that use these existing mechanisms. One approach is OCLC's PURL system. A PURL is a URL, such as:

http://purl.oclc.org/catalog/item1

In this identifier, "purl.oclc.org" is the domain name of a computer that is expected to be persistent. On this computer, the file "catalog/item1" holds a URL to the location where the item is currently stored. If the item is moved, this URL must be changed, but the PURL, which is the external name, is unaltered.

PURLs add an interesting twist to how names are managed. Other naming systems set out to have a single coordinated set of names for a large community, perhaps the entire world. This can be considered a top-down approach. PURLs are bottom-up.

Since each PURL server is separate, there is no need to coordinate the allocation of names among them. Names can be repeated or used with different semantics, depending entirely on the decisions made locally. This contrasts with the Digital Object Identifiers, where the publishers are building a single set of names that are guaranteed to be unique.

Structural metadata and object models

Data types

Data types are structural metadata which is used to describe the different types of object in a digital library. The web object model consist of hyperlinked files of data, each with a data type which tells the user interface how to render the file for presentation to the user. The standard method of rendering is to copy the entire object and render it on the user's computer. Chapter 2 introduced the concept of data type and discussed the importance of MIME as a standard for defining the type of files that are exchanged by electronic mail or used in the web. As described in Panel 12.7, MIME is a brilliant example of a standard that is flexible enough to cover a wide range of applications, yet simple enough to be easily incorporated into computer systems.

Panel 12.7. MIME

The name MIME was originally an abbreviation for "Multipurpose Internet Mail Extensions". It was developed by Nathaniel Borenstein and Ned Freed explicitly for electronic mail, but the approach that they developed has proved to be useful in a wide range of Internet applications. In particular, it is one of the simple but flexible building blocks that led to the success of the web.

The full MIME specification is complicated by the need to fit with a wide variety of electronic mail systems, but for digital libraries the core is the concept that MIME calls "Content-Type". MIME describes a data type as three parts, as in the following example:

Content-Type: text/plain; charset = "US-ASCII"

The structure of the data type is a type ("text"), a subtype ("plain"), and one or more optional parameters. This example defines plain text using the ASCII character set.

Here are some commonly used types:

text/plain text/html image/gif

image/jpeg image/tiff audio/basic audio/wav video/mpeg video/quickdraw

The application type provides a data type for information that is to be used with some application program. Here are the MIME types for files in PDF, Microsoft Word, and PowerPoint formats:

application/pdf application/msword application/ppt

Notice that application/msword is not considered to be a text format, since a Microsoft Word file may contain information other than text and requires a specific computer program to interpret it.

These examples are official MIME types, approval by the formal process for registering MIME types. It is also possible to create unofficial types and subtypes with names beginning "x-", such as "audio/x-pn-realaudio".

Lessons from MIME

MIME's success is a lesson on how to turn a good concept into a widely adopted system. The specification goes to great lengths to be compatible with the systems that preceded it. Existing Internet mail systems needed no alterations to handle MIME messages. The processes for checking MIME versions, for registering new types and subtypes, and for changing the system were designed to fit naturally within standard Internet procedures. Most importantly, MIME does not attempt to solve all problems of data types and is not tightly integrated into any particular applications. It provides a flexible set of services that can be used in many different contexts. Thus a specification that was designed for electronic mail has had its greatest triumph in the web, an application that did not exist when MIME was first introduced.

Complex objects

Materials in digital libraries are frequently more complex than files that can be represented by simple MIME types. They may be made of several elements with different data types, such as images within a text, or separate audio and video tracks;

they may be related to other materials by relationships such as part/whole, sequence, and so on. For example, a digitized text may consist of pages, chapters, front matter, an index, illustrations, and so on. An article in an online periodical might be stored on a computer system as several files containing text and images, with complex links among them. Because digital materials are easy to change, different versions are created continually.

A single item may be stored in several alternate digital formats. When existing material is converted to digital form, the same physical item may be converted several times. For example, a scanned photograph may have a high-resolution archival version, a medium quality version, and a thumbnail. Sometimes, these formats are exactly equivalent and it is possible to convert from one to the other (e.g., an uncompressed image and the same image stored with a lossless compression). At

other times, the different formats contain different information (e.g., differing representations of a page of text in SGML and PostScript formats).

Structural types

To the user, an item appears as a single entity and the internal representation is unimportant. A bibliography or index will normally refer to it as a single object in a digital library. Yet, the internal representation as stored within the digital library may be complex. Structural metadata is used to represent the various components and the relationships among them. The choice of structural metadata for a specific category of material creates an object model.

Different categories of object need different object models: e.g., text with SGML mark-up, web objects, computer programs, or digitized sound recordings. Within each category, rules and conventions describe how to organize the information as sets of digital objects. For example, specific rules describe how to represent a digitized sound recording. For each category, the rules describe how to represent material in the library, how the components are grouped as a set of digital objects, the internal structure of each, the associated metadata, and the conventions for naming the digital objects. The categories are distinguished by a structural type.

Object models for computer programs have been a standard part of computing for many years. A large computer program consists of many files of programs and data, with complex structure and inter-relations. This relationship is described in a separate data structure that is used to compile and build the computer system. For a Unix program, this is called a "make file".

Structural types need to be distinguished from genres. For information retrieval, it is convenient to provide descriptive metadata that describes the genre. This is the category of material considered as an intellectual work. Thus, genres of popular music include jazz, blues, rap, rock, and so on. Genre is a natural and useful way to describe materials for searching and other bibliographic purposes, but another object model is required for managing distributed digital libraries. While feature films, documentaries, and training videos are clearly different genres, their digitized equivalents may be encoded in precisely the same manner and processed identically; they are the same structural type. Conversely, two texts might be the same genre - perhaps they are both exhibition catalogs - but, if one is represented with SGML mark-up, and the other in PDF format, then they have different structural types and object models.

Panel 12.8 describes an object model for a scanned image. This model was developed to represent digitized photographs, but the same structural type can be used for any bit-mapped image, including maps, posters, playbills, technical diagrams, or even baseball cards. They represent different content, but are stored and manipulated in a computer with the same structure. Current thinking suggests that even complicated digital library collections can be represented by a small number of structural types.

Less than ten structural types have been suggested as adequate for all the categories of material being converted by the Library of Congress. They include: digitized image, a set of pages images, a set of page images with associated SGML text, digitized sound recording, and digitized video recording.

Panel 12.9

Dalam dokumen Digital Libraries (Halaman 187-192)