Digital library technology is developing rapidly and so are the financial, organizational and social frameworks. As the cost of the underlying technology continues to fall, digital libraries become steadily less expensive.
Two Pioneers of Digital Libraries
It is one of the few important documents about digital libraries that is not available on the Internet. Part of the answer is that digital library technology is still immature, but the challenge is much more than technology.
The Internet and the World Wide Web The Internet
Most of the details are unimportant to users, but a basic understanding of the technology is helpful when designing and using digital libraries.
An introduction to TCP/IP
If the occasional packet doesn't arrive in time, the human ear would much rather lose small chunks of audio than wait for the missing packet to be retransmitted, which would be terribly jerky. Names of this format are known as domain names, and the system that associates domain names with IP addresses is known as the Domain Name System, or DNS.
The TCP/IP suite
The Internet tradition emphasizes collaboration, and even now the continued development of the Internet is still in the hands of engineers. An important characteristic of the Internet is that the engineers and computer scientists who develop and operate it are heavy users of their own technology.
NetNews or Usenet
Recently, efforts have been made to rewrite the history of the Internet to promote vested interests and to make individuals take responsibility for achievements that many shared. The process of Internet drafts becoming RFCs is an intense form of peer review, but takes place after a draft of the paper is officially posted.
The Internet Engineering Task Force and the RFC series
One of the articles of faith in scholarly publishing is that quality can only be achieved by peer review, the process by which each article is read by other specialists before publication. They include the formal specification of each version of the IP protocol, Internet mail, components of the World Wide Web, and many more.
The Los Alamos E-Print Archives
The Internet and its associated technology have been essential to the rapid growth of digital libraries. The Internet is a linked collection of information on many computers on the Internet around the world.
HTML
Another reason for the immediate success of the web was that the technology provided gateways to information not created specifically for the web. Each has importance that goes beyond the web in the general field of digital library interoperability.
An HTML example
This convention is easy for both the user and the creator of the web page. The second key component of the web is the Uniform Resource Locator, known as a URL.
HTTP
In the Web, and in a wide variety of Internet applications, the data type is specified by a scheme called MIME. The importance of MIME types in the web is that the data transferred by an HTTP get command has a MIME type associated with it.
The World Wide Web Consortium
But the web is no detour to follow until the real digital libraries come along. We can expect digital libraries to be very different twenty-five years from now; it will be hard to remember the early days of the web.
Libraries and publishers Computing in libraries
A MARC record A MARC record
Members of the university could use their own computers to search the catalogs of other universities. It is one of the few protocols widely used for interoperation between different computer systems.
Mercury and CORE
It simply reflected the fact that none of the journal publishers were able to provide other formats. The advent of the Internet and the widespread availability of web browsers went a long way toward solving the problem of user interface development.
HighWire Press
The Association for Computing Machinery's Digital Library
American Memory and the National Digital Library Program
In addition to copyright, other reasons for restrictions include conditions required of donors of the original material to the library. A final but important aspect of American Memory is that people look to the Library of Congress for leadership.
JSTOR
The library is aware of the longstanding difficulties of maintaining large collections and has placed great emphasis on the way it organizes the items within its collections. These fees are set at less than the comparable cost to the libraries of storing paper copies of the journals.
Innovation and research The research process
The structure of university libraries inhibits radical change, but university librarians know that computing is fundamental to the future of scholarly communication. Several projects mentioned in Chapter 3 are of this type, including HighWire Press at Stanford University, which puts scientific journals online, the collaboration between university libraries and Elsevier Science in the Tulip project to explore a digitized version of scientific journals, and the contribution from the University of Michigan and Princeton University to the JSTOR project to convert historical backlogs of important journals.
The Coalition for Networked Information
The Digital Libraries Initiative
Another impact of the Digital Libraries Initiative has been to clarify the distinction between digital library research and implementation. The centerpiece of this chapter is a quick overview of the main areas of research in digital libraries.
People, organizations, and change People and change
Panels in this section describe three others: the Netlib mathematical software library, the International Consortium for Political Science Research data archives, and the Perseus collections of classic texts. Digital libraries that were created by user communities are particularly interesting because services are built to meet the needs of disciplines, without preconceived ideas about how collections are conventionally managed.
Netlib
They employ professionals, but the leadership and most of the staff come from the respective disciplines of physics, computing, applied mathematics, the social sciences and classics.
Inter-university Consortium for Political and Social Research
Perseus
From this early work came one of the most important digital libraries in the humanities. Many materials in libraries were created to record certain events or decisions.
The Ticer Summer School
Conversely, people who are not comfortable with technology may find that they are left behind. Modern libraries need people who are aware of the changes happening around them, curious and open to discovering new ideas.
The School of Information Management and Systems at the University of California at
MELVYL
The nine campuses of the University of California often function as if they were nine independent universities. Each of the nine campuses has its own library and each recognizes the need to provide digital library services.
The renovation of Harvard Law Library
Law school faculty are known for preferring to work in their offices rather than walking to the library. However, the attention given to reading spaces in Langdell implies a belief that lawyers and law school students will come to the library to do serious work for many years to come.
Economic and legal issues Introduction
Some services tried to charge a monthly fee, but the creator of Lycos was determined to offer open access to everyone. Fortunately, librarians and publishers don't have to pay for one of the most expensive parts of digital libraries.
The economics of scientific journals
In the long run, electronic publications are cheaper to produce due to savings in printing, paper and distribution. Many legal issues are general Internet issues and not specific to digital libraries.
The Digital Millennium Copyright Act
Fair use is a legal right in United States law that allows certain uses of copyrighted information without permission from the copyright owner. This uncertainty was one of the reasons that led to a series of efforts to rewrite copyright law, both in the United States and internationally.
Events in the history of copyright
The court ruled that copyright protection in derivative works only applies to newly added material. The court ruled that copyright does not protect utilitarian or useful objects, in this case a sculptural lamp.
Digital library statistics and privacy
A few months later we met two other groups working on some of the same issues. The success of the Internet and the rapid expansion of digital libraries have been fueled by the open exchange of ideas.
Access management and security Why control access?
Smart cards are one of the best systems of authentication; they are highly secure and quite convenient to use. With such documents, the exact wording is essential; if a document claims to be the text of the North American Free Trade Agreement, the reader must be confident that the text is accurate.
Electronic registration and deposit for copyright
A digital signature confirms to the copyright office that the submission was properly received and confirms the identity of the sender. Digital libraries may have policies that depend on the time since the publication date or physical characteristics such as the size of the material.
Access management policies for computer software
Cryptolopes
It can only be opened by recipients after they have met any access management requirements, such as paying for the use of the information. To view premium content, the user agrees to the terms of the Cryptolope container as stated in the summary.
The Data Encryption Standard (DES)
Private key encryption is only as secure as the procedures used to keep the key secret. Public key cryptography is one of the few areas where most computer scientists would agree that there were genuine inventions.
User interfaces and usability
The right side of Figure 8.1 shows the layers needed to implement any conceptual model. At the top is the design of the interface, the appearance on the screen and the actual manipulation by the user.
Aspects of a user interface: page turning
Java
A user who wants to run a new user interface must first find a version of the user interface for the specific type of computer. The usual process is then to compile the program into the machine language of the specific computer.
New conceptual models: DLITE and Pad++
Informedia
In combination, the selected words and images provide a video abstract that conveys the essence of the complete video segment. The image is automatically selected as representative of the video segment as it relates to the current query.
Text
By using separate typography sheets, a single document, represented by structural markups, can be rendered in different ways for different purposes. Mark-up languages can represent almost any structure, but the variety of structural elements that can be part of a document is enormous, and the details of appearance that authors and designers can choose are equally diverse.
The Oxford English Dictionary
For important documents, conversion projects capture the appearance and also identify the structure of the original. A scanned page reproduces the appearance of the printed page, but represents text simply as an image.
ASCII
Therefore, the ninety-six printable ASCII characters are used in applications where interoperability is a high priority. Text materials use a much wider range of characters than the printable ASCII set, with its basis in English.
Scripts represented in Unicode
Unicode was not adopted simply because of the efforts of linguists to support a wide range of languages. Unicode is not the only method used to represent a wide range of characters on computers.
SGML
Therefore, there is a special representation of Unicode characters, known as UTF-8, that allows the gradual transformation of ASCII-based applications to the full range of Unicode scripts. They were using a wide range of alphabets long before the computer industry paid attention to the problem.
Document type definitions (DTDs) for scholarship
Digital library projects such as JSTOR and American Memory use simple DTDs derived from the work of the Text Encoding Initiative. HTML, the markup language used by the web, can be considered an unorthodox DTD.
Features of HTML
The process requires that the structural tags in the annotation be translated into formats that can be displayed either in print or on the screen. The image the user sees comes from a combination of annotations provided by the designer of a website, formatting conventions built into a browser, and options selected by the user.
Cascading Style Sheets (CSS) and Extensible Style Language (XSL)
The h1 and h2 headings are HTML body elements, but they have an explicit rule; they will be displayed in a sans-serif font. Because multiple style sheets can be used for the same page, conflicts can occur where the rules conflict.
Portable Document Format (PDF)
PDF is widely used in commercial document management systems, but some digital libraries have been reluctant to use PDF. In addition, some libraries and digital archives reject PDF because the format is owned by a single company.
Information retrieval and descriptive metadata
On the one hand, it is crucial to build on the investments and the expertise behind them. In digital libraries, the role of MARC and the related cataloging rules is a source of debate.
MeSH - medical subject headings
Medicine in the United States is especially fortunate to have a cadre of reference librarians who can support users. In digital libraries, the trend is to provide users with tools that allow them to find information directly without the help of a reference librarian.
The Art and Architecture Thesaurus
It also requires skilled users with help desk tools, as the terms used in the search query must be consistent with the terms assigned by the indexer.
Dublin Core elements
Much of the development that led to automatic indexing came from text analysis research. These are the meta tags from an HTML description of the Dublin Essentials set.
The Resource Description Framework (RDF)
Inverted files
An inverted file is a list of words in a set of documents and their locations within those documents. A feature of vector space and probabilistic information retrieval methods is that they are more effective with long queries.
Tipster and TREC
Criteria are needed to measure the effectiveness of ranking in giving high rankings to the most relevant subjects. The effectiveness of information discovery depends on the goals of the users and how well the digital library meets them.
Distributed information discovery Distributed computing and interoperability
Because of the way the indexing programs traverse the Internet, they are often called web crawlers. The web search programs allow users to search the index, using information retrieval methods of the kind described in Chapter 10.
Page ranks and Google
As the web has grown larger and the management of the search programs has become a commercial venture, it has become more extensive. Research at the University of Illinois, Urbana Champaign provides a telling example of the difficulties of interoperability.
The University of Illinois federated library of scientific literature
In concept, Z39.50 is not tied to any particular category of information or type of database, but much of the development has concentrated on bibliographic data. The protocol makes no statements about the shape of that user interface or how it connects to the Z39.50 client.
NCSTRL and the Dienst model of distributed searching
Each information service makes some implicit assumptions about the scenarios it supports, the queries it accepts, and the types of responses it provides. This was one of the motivations behind Dublin Core and the Resource Description Framework (RDF), which were described in Chapter 10.
The Harvest architecture
The fundamental concept is to enable customers to discover broad features of the search engines and the collections they maintain. The challenge is that the search engines are different and the collections have different characteristics.
Object models, identifiers, and structural metadata
When many copies of a manifestation are made, each is a separate item, such as a specific copy of a book or computer file. They include the digital equivalents of familiar objects, such as maps, audio recordings and video, and other objects that provide the user with a direct representation of the stored form of a digital object.
Geospatial collections: the Alexandria library
Coverage specifies the geographic area covered, such as the city of Santa Barbara or the Pacific Ocean. Extent describes various information such as topographical features, political boundaries or population density.
Informedia: multi-modal information retrieval
RealAudio
The first is that the user's browser must accept a stream of audio data in RealAudio format. When the user accesses this object, the information returned can be data, such as the temperature, precipitation, wind speed and direction, and humidity, or it can be a photo to show cloud cover.
Domain names
There are several commercial organizations called Apple, and the name gives no indication as to whether this website is managed by Apple Computer or another company. Thus, anyone could register the name "pittsburgh.net", without any connection to the city of Pittsburgh.
Information contained in a Uniform Resource Locator (URL)
The goal is to have names that can last longer than any software system that exists today, even longer than the Internet itself. Another application is to provide email addresses that do not need to be changed when a person changes jobs or moves to a different ISP.
Handles and Digital Object Identifiers
MIME
When existing material is converted to digital form, the same physical item can be converted multiple times. This model was developed to represent digitized photographs, but the same structural type can be used for any bit-mapped image, including maps, posters, playbills, engineering diagrams, or even baseball cards.
An object model for scanned images
In a digital library, the stored form of information is rarely the same as the form delivered to the user. One of the goals of object models is to provide the user with a variety of distribution options.
Repositories and archives Repositories
Web servers
One of the requirements of web servers (and also web browsers) is to continue to support older versions of the HTTP protocol. They must be prepared for messages in any version of the protocol and handle them accordingly.
The Warwick Framework
The information within an object is encapsulated so that the inside of the object is hidden. CORBA provides developers on distributed computing systems with many of the same programming capabilities that object-oriented programming provides within a single computer.