The following fifteen elements form the Dublin Core metadata set. All elements are optional and all can be repeated. The descriptions given below are condensed from the official Dublin Core definitions, with permission from the design team.
1. Title. The name given to the resource by the creator or publisher.
2. Creator. The person or organization primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
3. Subject. The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemes is
encouraged.
4. Description. A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.
5. Publisher. The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.
6. Contributor. A person or organization not specified in a creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a creator element (for example, editor, transcriber, and illustrator).
7. Date. A date associated with the creation or availability of the resource.
8. Type. The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary.
9. Format. The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource.
10. Identifier. A string or number used to uniquely identify the resource.
Examples for networked resources include URLs and URNs.
11. Source. Information about a second resource from which the present resource is derived.
12. Language. The language of the intellectual content of the resource.
13. Relation. An identifier of a second resource and its relationship to the present resource. This element permits links between related resources and resource descriptions to be indicated. Examples include an edition of a work
(IsVersionOf), or a chapter of a book (IsPartOf).
14. Coverage. The spatial locations and temporal durations characteristic of the resource.
15. Rights. A rights management statement, an identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource.
Simplicity is both the strength and the weakness of the Dublin Core. Whereas traditional cataloguing rules are long and complicated, requiring professional training to apply effectively, the Dublin Core can be described simply, but simplicity conflicts with precision. The team has struggled with this tension. Initially the aim was to create a single set of metadata elements, suitable for untrained people who publish
electronic materials to describe their work. Some people continue to hold this minimalist view. They would like to see a simple set of rules that anybody can apply.
Other people prefer the benefits that come from more tightly controlled cataloguing rules and would accept the additional labor and cost. They point out that extra structure in the elements results in extra precision in the metadata records. For example, if entries in a subject field are drawn from the Dewey Decimal Classification, it is helpful to record that fact in the metadata. To further enhance the effectiveness of the metadata for information retrieval, several of the elements will have recommended lists of values. Thus, there might be a specified set of types and indexers would be recommended to select from the list.
The current strategy is to have two options, "minimalist" and "structuralist". The minimalist will meet the original criterion of being usable by people who have no formal training. The structured option will be more complex, requiring fuller guidelines and trained staff to apply them.
Automatic indexing
Cataloguing and indexing are expensive when carried out by skilled professionals. A rule of thumb is that each record costs about fifty dollars to create and distribute. In certain fields, such as medicine and chemistry, the demand for information is great enough to justify the expense of comprehensive indexing, but these disciplines are the exceptions. Even monograph cataloguing is usually restricted to an overall record of the monograph rather than detailed cataloguing of individual topics within a book.
Most items in museums, archives, and library special collections are not catalogued or indexed individually.
In digital libraries, many items are worth collecting but the costs of cataloguing them individually can not be justified. The numbers of items in the collections can be very large, and the manner in which digital library objects change continually inhibits long- term investments in catalogs. Each item may go through several versions in quick succession. A single object may be composed of many other objects, each changing independently. New categories of object are being continually devised, while others are discarded. Frequently, the user's perception of an object is the result of executing a computer program and is different with each interaction. These factors increase the complexity and cost of cataloguing digital library materials.
For all these reasons, professional cataloguing and indexing is likely to be less central to digital libraries than it is in traditional libraries. The alternative is to use computer programs to create index records automatically. Records created by automatic indexing are normally of poor quality, but they are inexpensive. A powerful search system will go a long way towards compensating for the low quality of individual records. The web search programs prove this point. They build their indexes automatically. The records are not very good, but the success of the search services shows that the indexes are useful. At least, they are better than the alternative, which is to have nothing. Panel 10.4 gives two examples of records that were created by automatic indexing.
Panel 10.4
Examples of automatic indexing
The two following records are typical of the indexing records that are created automatically by web search programs. They are lightly edited versions of records that were created by the Altavista system in 1997.
Digital library concepts. Key Concepts in the Architecture of the Digital Library. William Y. Arms Corporation for National Research Initiatives Reston, Virginia...
http://www.dlib.org/dlib/July95/07arms.html - size 16K - 7-Oct- 96 - English
Repository References. Notice: HyperNews at
union.ncsa.uiuc.edu will be moving to a new machine and domain very soon. Expect interruptions. Repository References.
This is a page.
http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.
html - size 5K - 12-May-95 - English
The first of these example shows automatic indexing at its best. It includes the author, title, date, and location of an article in an electronic journal. For many purposes, it is an adequate substitute for a record created by a professional indexer.
The second example shows some of the problems with automatic indexing. Nobody who understood the content would bother to index this web page. The information about location and date are probably all right, but the title is strange and the body of the record is simply the first few words of a the page.
Much of the development that led to automatic indexing came out of research in text skimming. A typical problem in this field is how to organize electronic mail. A user has a large volume of electronic mail messages and wants to file them by subject. A computer program is expected to read through them and assign them to subject areas.
This is a difficult problem for people to carry out consistently and is a very difficult problem for a computer program, but steady progress has been made. The programs look for clues within the document. These clues may be structural elements, such as the subject field of an electronic mail message, they may be linguistic clues, or the program may simply recognize key words.
Automatic indexing also depends upon clues to be found in a document. The first of the examples in Panel 10.4 is a success, because the underlying web document provides useful clues. The Altavista indexing program was able to identify the title and author. For example, the page includes the tagged element:
<title>Digital library concepts</title>
The author inserted these tags to guide web browsers in displaying the article. They are equally useful in providing guidance to automatic indexing programs.
One of the potential uses of mark-up languages, such as SGML or XML, is that the structural tags can be used by automatic indexing programs to build records for information retrieval. Within the text of a document, the string, "Marie Celeste" might be the name of a person, a book, a song, a ship, a publisher, a play, or might not even be a name. With structural mark-up, the string can be identified and labeled for what it
is. Thus, information provided by the mark-up can be used to distinguish specific categories of information, such as author, title, or date.
Automatic indexing is fast and cheap. The exact costs are commercial secrets, but they are a tiny fraction of one cent per record. For the cost of a single record created by a professional cataloguer or indexer, computer programs can generate a hundred thousand or more records. It is economically feasible to index huge numbers of items on the Internet and even to index them again at frequent intervals.
Creators of catalogs and indexes can balance costs against perceived benefits. The most expensive forms of descriptive metadata are the traditional methods used for library catalogs, and by indexing and abstracting services; structuralist Dublin Core will be moderately expensive, keeping most of the benefits while saving some costs;
minimalist Dublin Core will be cheaper, but not free; automatic indexing has the poorest quality at a tiny cost.
Attaching metadata to content
Descriptive metadata needs to be associated with the material that it describes. In the past, descriptive metadata has usually been stored separately, as an external catalog or index. This has many advantages, but requires links between the metadata and the object it references. Some digital libraries are moving in the other direction, storing the metadata and the data together, either by embedding the metadata in the object itself or by having two tightly linked objects. This approach is convenient in distributed systems and for long-term archiving, since it guarantees that computer programs have access to both the data and the metadata at the same time.
Mechanisms for associating metadata with web pages have been a subject of considerable debate. For an HTML page, a simple approach is to embed the metadata in the page, using the special HTML tag , as in Table 10.1. These are the meta tags from an HTML description of the Dublin Core Element Set. Note that the choice of tags is a system design decision. The Dublin Core itself does not specify how the metadata is associated with the material.
Table 10.1
Metadata represented with HTML <meta> tags
<meta name="DC.subject"
content="dublin core metadata element set">
<meta name="DC.subject"
content="networked object description">
<meta name="DC.publisher"
content="OCLC Online Computer Library Center, Inc.">
<meta name="DC.creator"
content="Weibel, Stuart L., [email protected].">
<meta name="DC.creator"
content="Miller, Eric J., [email protected].">
<meta name="DC.title"
content="Dublin Core Element Set Reference Page">
<meta name="DC.date"
content="1996-05-28">
<meta name="DC.form" scheme="IMT"
content="text/html">
<meta name="DC.language" scheme="ISO639"
content="en">
<meta name="DC.identifier" scheme="URL"
content="http://purl.oclc.org/metadata/dublin_core">
Since meta tags can not be used with file types other than HTML and rapidly become cumbersome, a number of organizations working through the World Wide Web Consortium have developed a more general structure known as the Resource Description Framework (RDF). RDF is described in Panel 10.5.