• Tidak ada hasil yang ditemukan

A MARC record A MARC record

Dalam dokumen Digital Libraries (Halaman 37-42)

Libraries and publishers Computing in libraries

Panel 3.1. A MARC record A MARC record

Consider a monograph, for which the conventional bibliographic citation is:

Caroline R. Arms, editor, Campus strategies for libraries and electronic information. Bedford, MA: Digital Press, 1990.

A search of the catalog at the Library of Congress, using one of the standard terminal-based interfaces, displays the catalog in a form that shows the information in the underlying MARC record format.

&00189-16879 r93

&050 Z675.U5C16 1990

&082 027.7/0973 20

&245 Campus strategies for libraries and electronic information/Caroline Arms, editor.

&260 {Bedford, Mass.} : Digital Press, c1990.

&300 xi, 404 p. : ill. ; 24 cm.

&440 EDUCOM strategies series on information technology

&504 Includes bibliographical references (p. {373}-381).

&020 ISBN 1-55558-036-X : $34.95

&650 Academic libraries--United States--Automation.

&650 Libraries and electronic publishing--United States.

&650 Library information networks--United States.

&650 Information technology--United States.

&700 Arms, Caroline R. (Caroline Ruth)

&040 DLC DLC DLC &043 n-us---

&955 CIP ver. br02 to SL 02-26-90

&985 APIF/MIG

The information is divided into fields, each with a three-digit code. For example, the 440 field is the title of a monograph series, and the 650 fields are Library of Congress subject headings. Complex rules specify to the cataloguer which fields should be used and how relationships between elements should be interpreted.

The actual coding is more complex than shown. The full MARC format consists of a pre-defined set of fields each identified by a tag. Within each field, subfields are permitted. Fields are identified by three-digit numeric tags and subfields by single letters. To get a glimpse of how information is encoded in this format, consider the 260 field, which begins "&260". In an actual MARC record, this is encoded as:

&2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%

This has information about publication, divided into three subfields. The string "abc"

indicates that there are three subfields. The first with tag "a" is the place of publication, the next with tag "b" is the publisher, and the third with tag "c" is the date.

The development of MARC led to two important types of computer-based system.

The first was shared cataloguing; the pioneer was OCLC, created by Fred Kilgour in 1967. OCLC has a large computer system which has grown to more 35 million catalog records in MARC format, including records received from the Library of Congress. When an OCLC member library acquires a book that it wishes to catalog, it begins by searching the OCLC database. If it finds a MARC record, it downloads the record to its own computer system and records the holding in the OCLC database. In the past it could also have ordered a printed catalog card. This is called "copy cataloguing". If the OCLC database does not contain the record, the library is encouraged to create a record and contribute it to OCLC. With copy cataloguing, each item is catalogued once and the intellectual effort shared among all libraries. MARC cataloguing and OCLC's success in sharing catalog records have been emulated by similar services around the world.

The availability of MARC records stimulated a second development. Individual libraries were are to create online catalogs of their holdings. In most cases, the bulk of the records were obtained from copy cataloguing. Today, almost every substantial library in the United States has its online catalog. Library jargon calls such a catalog an "OPAC" for "online public access catalog". Many libraries have gone to great efforts to convert their old card catalogs to MARC format, so that the online catalog is the record of their entire holdings, rather than having an online catalog for recent acquisitions, but traditional card catalogs for older materials, The retrospective conversion of Harvard University's card catalog to MARC format has recently been completed. Five million cards were converted at a cost approaching $15 million.

A full discussion of MARC cataloguing and online public access catalogs is outside the scope of this book. MARC was an innovative format at a time when most computer systems represented text as fixed length fields with capital letters only. It remains a vital format for libraries, but it is showing its age. Speculation on the future of MARC is complicated by the enormous investment that libraries have made in it.

Whatever its future, MARC was a pioneering achievement in the history of both computing and libraries. It is a key format that must be accommodated by digital libraries.

Linking online catalogs and Z39.50

During the 1980s, universities libraries began to connect their online catalogs to networks. As an early example, by 1984 there was a comprehensive campus network at Dartmouth College. Since the computer that held the library catalog was connected to the network, anybody with a terminal or personal computer on-campus could search the catalog. Subsequently, when the campus network was connected to the Internet, the catalog became available to the whole world. People outside the university could search the catalog and discover what items were held in the libraries at Dartmouth. Members of the university could use their own computers to search the catalogs of other universities. This sharing of library catalogs was one of the first, large-scale examples of cooperative sharing of information over the Internet.

In the late 1970s, several bibliographic utilities, including the Library of Congress, the Research Libraries Group, and the Washington Libraries Information Network, began a project known as the Linked Systems Project, which developed the protocol now known by the name of Z39.50. This protocol allows one computer to search for information on another. It is primarily used for searching records in MARC format, but the protocol is flexible and is not restricted to MARC. Technically, Z39.50 specifies rules that allow one computer to search a database on another and retrieve the records that are found by the search. Z39.50 and its role in fostering interoperability among digital libraries are discussed in Chapter 11. It is one of the few protocols to be widely used for interoperation among diverse computer systems.

Abstracts and indexes

Library catalogs are the primary source of information about monographs, but they are less useful for journals. Catalogs provide a single, brief record for an entire run of a journal. This is of little value to somebody who wants to discover individual articles in academic journals. Abstracting and indexing services developed to help researchers to find such information. Typical services are Medline for the biomedical literature, Chemical Abstracts for chemistry, and Inspec for the physical sciences including computing. The services differ in many details, but the basic structures are similar.

Professionals, who are knowledgeable about a subject area, read each article from a large number of journals and assign index terms or write abstracts. Sometimes services use index terms that are drawn from a carefully controlled vocabulary, such as the MeSH headings that the National Library of Medicine uses for its Medline service. Others services are less strict. Some generate all their own abstracts. Others, such as Inspec, will use an abstract supplied by the publisher.

Most of these services began as printed volumes that were sold to libraries, but computer searching of these indexes goes back to the days of batch processing and magnetic tape. Today, almost all searching is by computer. Some indexing services run computer systems on which users can search for a fee; others license their data to third parties who provide online services. Many large libraries license the data and mount it on their own computers. In addition, much of the data is available on CD- ROM.

Once the catalog was online, libraries began to mount other data, such as abstracts of articles, indexes, and reference works. These sources of information can be stored in a central computer and the retrieved records displayed on terminals or personal computers. Reference works consisting of short entries are particularly suited for this form of distribution, since users move rapidly from one entry to another and will

accept a display that has text characters with simple formatting. Quick retrieval and flexible searching are more important than the aesthetics of the output on the computer screen.

As a typical example, here are some of the many information sources that the library at Carnegie Mellon University provided online during 1997/8.

Carnegie Mellon library catalog Carnegie Mellon journal list

Bibliographic records of architectural pictures and drawings Who's who at CMU

American Heritage Dictionary Periodical Abstracts

ABI/Inform (business periodicals)

Inspec (physics, electronics, and computer science) Research directory (Carnegie Mellon University)

Several of these online collections provide local information, such as Who's who at CMU, which is the university directory of faculty, students, and staff. Libraries do not provide their patrons only with formally published or academic materials. Public libraries, in particular, are a broad source of information, from tax forms to bus timetables. Full-text indexes and web browsers allow traditional and non-traditional library materials to be combined in a single system with a single user interface. This approach has become so standard that it is hard to realize that only a few years ago merging information from such diverse sources was rare.

Mounting large amounts of information online and keeping it current is expensive.

Although hardware costs fall continually, they are still noticeable, but the big costs are in licensing the data and the people who handle both the business aspects and the large data files. To reduce these costs, libraries have formed consortia where one set of online data serves many libraries. The MELVYL system, which serves the campuses of the University of California, was one of the first. It is described in Chapter 5.

Information retrieval

Information retrieval is a central topic for libraries. A user, perhaps a scientist, doctor, or lawyer, is interested in information on some topic and wishes to find the objects in a collection that cover the topic. This requires specialized software. During the mid- 1980s, libraries began to install computers with software that allowed full-text searching of large collections. Usually, the MARC records of a library's holdings were the first data to be loaded onto this computer, followed by standard references works.

Full-text searching meant that a user could search using any words that appeared in the record and did not need to be knowledgeable about the structures of the records or the rules used to create them.

Research in this field is at least thirty years old, but the basic approach has changed little. A user expresses a request as a query. This may be a single word, such as

"cauliflower", a phrase, such as "digital libraries", or a longer query, such as, "In what year did Darwin travel on the Beagle?" The task of information retrieval is to find objects in the collection that match the query. Since a computer does not have the time to go through the entire collection for each search, looking at every object separately, the computer must have an index of some sort that allows information retrieval by looking up entries in indexes.

As computers have grown more powerful and the price of storage declined, methods of information retrieval have moved from carefully controlled searching of short records, such as catalog records or those used by abstracting and indexing services, to searching the full text of every word in large collections. In the early days, expensive computer storage and processing power, stimulated the development of compact methods of storage and efficient algorithms. More recently, web search programs have intensified research into methods for searching large amounts of information which are distributed across many computers. Information retrieval is a topic of Chapters 10 and 11.

Representations of text and SGML

Libraries and publishers share an interest in using computers to represent the full richness of textual materials. Textual documents are more than simple sequences of letters. They can contain special symbols, such as mathematics or musical notation, characters from any language in the world, embedded figures and table, various fonts, and structural elements such as headings, footnotes, and indexes. A desirable way to store a document in a computer is to encode these features and store them with the text, figures, tables, and other content. Such an encoding is called a mark-up language. For several years, organizations with a serious interest in text have been developing a mark-up scheme known as SGML. (The name is an abbreviation for Standard Generalized Markup Language.) HTML, the format for text that is used by the web, is a simple derivative of SGML.

Since the representation of a document in SGML is independent of how it will be used, the same text, defined by its SGML mark-up, can be displayed in many forms and formats: paper, CD-ROM, online text, hypertext, and so on. This makes SGML attractive for publishers who may wish to produce several versions of the same underlying work. A pioneer application in using SGML in this way was the new Oxford English Dictionary. SGML has also been heavily used by scholars in the humanities who find in SGML a method to encode the structure of text that is independent of any specific computer system or method of display. SGML is one of the topics in Chapter 9.

Digital libraries of scientific journals

The early experiments

During the late 1980s several publishers and libraries became interested in building online collections of scientific journals. The technical barriers that had made such projects impossible earlier were disappearing, though still present to some extent. The cost of online storage was coming down, personal computers and networks were being deployed, and good database software was available. The major obstacles to building digital libraries were that academic literature was on paper, not in electronic formats, and that institutions were organized around physical media, not computer networks.

One of the first attempts to create a campus digital library was the Mercury Electronic Library, a project that we undertook at Carnegie Mellon University between 1987 and 1993. Mercury was able to build upon the advanced computing infrastructure at Carnegie Mellon, which included a high-performance network, a fine computer science department, and the tradition of innovation by the university libraries. A slightly later effort was the CORE project at Cornell University to mount images of

chemistry journals. Both projects worked with scientific publishers to scan journals and establish collections of online page images. Whereas Mercury set out to build a production system, CORE also emphasized research into user interfaces and other aspects of the system by chemists. The two projects are described in Panel 3.2.

Dalam dokumen Digital Libraries (Halaman 37-42)