SHIRLEY HARPHAM
Department of Anthropology, University of Alberta, Edmonton, Alberta T6G 2H4, Canada Abstract.—Documentation standards are the actual rules of structure, content and value that allow the design and implementation of a museum’s information systems. Clean, useable data is the product of well designed standards and well maintained documentation systems.
Since the upgrading of automated documentation systems is inevitable and the process of upgrading will create the need for updating, reorganizing and cleaning of systems and ex- isting data, a set of procedures to facilitate this process would seem desirable. The following is a strategy of three rules and six steps for the review of documentation standards in museum databases. The steps include studying standards documents and choosing several as primary sources; setting data structure and formatting rules; adoption or creation of controlled vocabularies; production of data dictionaries, prompts and online helps; and fi- nally data cleaning. The international museum community needs to work toward using common documentation standards by promoting the use of the same established standards.
Improved data dictionary headings are needed for standards and more universal thesauri, particularly for locality information, are needed to avoid duplication in labor and set future standards.
INTRODUCTION
This project developed from a need to manage the aftermath of a database upgrade. After the upgrade and migration of data, there was a need to clean, tidy, straighten, or put in order all parts of the documentation system, both the data and the database structure. The aim of this paper is to present a useful set of procedures for electronic database design and data cleaning following a database upgrade. Three rules and six stages or steps are described. After using some international standards documents as a guide to this work, some suggestions for future international data standards initiatives have become apparent. These are described at the end of the paper.
Museums do not just curate objects. They are also responsible for the infor- mation associated with those objects. This information, or documentation, is con- sidered an integral part of the object and is important in its own right. Providing access to this information is a large part of a museum’s function. Museums define and strive to follow professional standards of practice in all their endeavors. In- formation management, as part of collections management is one of these areas of standard practice (Museums Alberta 2001). Although information management includes paper as well as automated systems, the focus of this paper is computer databases.
There are many standards documents and resources available in published form and on the web. Most of these contain descriptions of theoretical minimum or best practices for information management; the procedural standards. Many also contain a list of actual rules of structure, content and value that set out the re- quirements to put the standards into practice for recording and entering data; the documentation standards (Bower et al. 2001). There is consensus among these standards documents that documentation standards are important for providing consistency and efficiency for cataloguing and searching, and they allow sharing
of data (CHIN 2004a). For more details on the role and purpose of documentation standards see the introductory section of the International Guidelines for Museum Object Information: the CIDOC Information Categories (ICOM-CIDOC 1995).
In this paper, three standards sources will be used as examples. The author has used these standards documents extensively as resources in designing and imple- menting documentation standards and cleaning and reorganization of databases and data.
1. International Council of Museums (ICOM) especially the International Com- mittee for Documentation (CIDOC).
a) ICOM-CIDOC’s Museum Information Standards (Roberts and Will 2002).
b) ICOM’s Guidelines and Standards for Museums (ICOM-CIDOC 2004).
c) International Guidelines for Museum Object Information: the CIDOC In- formation Categories (ICOM-CIDOC 1995).
2. The Canadian Heritage Information Network (CHIN 2004a).
a) Standards section.
b) Collections Management section including Humanities and Natural Scienc- es data dictionaries that provide some documentation standards.
3. MDA (previously the Museum Documentation Association) in the UK (MDA 2005).
a) Standards and Fact Sheet pages.
b) SPECTRUM, MDA’s documentation standard.
These sources recognize three different types of documentation standards: val- ue, format, and structure. Value standards are authority lists and classification schemes or hierarchies, also called vocabulary standards or terminology control.
Format standards are the rules for structure and syntax and are also called cata- loguing standards or content standards. The third, data structure standards, are the content of the fields and are termed metadata standards by CHIN (2004a). In this paper I will use metadata as defined by Gilliland-Swetland (2000) and used by CHIN (2004a). Gilliland-Swetland defines metadata as ‘‘data about data’’ and then describes how this can produce a very broad conception of metadata that can be seen to have many types, functions, attributes and characteristics. A similar def- inition is used by CHIN ‘‘museum collections management records (whether pa- per based or automated) would be considered by some to be ‘‘metadata’’ about the collection’’ (CHIN 2004b:1). Caution is needed here to avoid confusion about the use of the word ‘‘standard.’’ It can be used to describe a particular national or international standard or as a general term in for example ‘‘all museum doc- umentation should follow a written set of standards.’’ In this paper, the expression
‘‘standard document’’ will describe the former and the term ‘‘standard’’ will be used to describe a general item or a user/institution defined set of standards for their particular database or information system.
UPGRADEISSUES
The University of Alberta, Department of Anthropology, Archaeology Program uses the Oracle based MIMSY database by Willoughby Associates. We recently upgraded to a new version of MIMSY and as the upgrade progressed it became clear that there were several issues that would need attention. The first issue is legacy data. Most museum collection documents begin as paper records with
194 COLLECTION FORUM Vol. 21(1–2)
minimal format and other standards. These records were often written in natural language with normal grammar and punctuation so converting to indexed com- puter fields is not always simple. Our data has been on a computer system since the 1980s and has followed standards and had defined data dictionaries since the start. However, each time a database is changed there is often considerable review and cleaning required. In the last few decades, many museums have used a variety of computer databases starting with flat records and evolving through several relational databases. Having previous databases necessitates accommodating leg- acy data from earlier systems. The second issue is that each upgrade or new system spawns its own new metadata fields (CHIN 2004a, Gilliland-Swetland 2000). Our early systems had thirty or forty fields but our most recent system has several hundred individual fields. The third issue is that the default upgrade ver- sion will probably have a general data structure that will not only be different from the earlier version but will have to be tailored to individual user needs. Since there will be changes in structure, format and value, user defined documentation standards will be different from the older database version. This will necessitate rethinking standards, rewriting data dictionaries, and moving and cleaning data.
Information Technology (IT) personnel will claim that data can be ‘‘easily migrated’’ from one system to the next, but this does not take into account all the reorganization that is required to bring everything up to a single set of stan- dards and formatting rules. Berendsohn, et al. (1999) and Morris (2005) both refer to the fact that the IT industry is continually changing so the need to move or update systems will continue. Morris (2005) estimates that a database lifecycle will repeat every 10 years. As upgrading and this periodic review and cleaning is inevitable, the following set of procedures may be helpful.
Three simple rules are applicable to these procedures. The first two of these, Morris (2005) describes as core concepts in information modeling. The first of these, ‘‘atomization,’’ is the idea that more than one concept should never be put in a field. As well as being a problem for individual databases, fields where concepts were combined and need to be split can cause difficulties for distributed systems (Heather Dunn pers. com. 24 February 2005). The second core concept (Morris 2005) is the ‘‘reduction of redundant information.’’ This idea is funda- mental to relational databases and involves adding another table instead of re- peating data in more than one row. A good example of this is locality data. If you have many specimens that come from Alberta, Canada, you would want to link the locality information to a second table that acts as the Place Authority instead of continually retyping ‘‘Alberta, Canada’’ in your object table. The third rule is ‘‘never overwrite original data’’ even if you are correcting mistakes. Me- tadata fields, including fields for date and attributor, will need to be added in order to properly document this process. This separation reduces the risk of artificial accuracy, preventing inferences made during cleanup or other research being mis- taken for original facts (Morris 2005, Murphey et al. 2004). This partly explains why the number of fields tends to grow each time a system is changed or up- graded.
PROCEDURES
This set of procedures applies to multiple fields, but Field Groups need to be tackled first and then on a field-by-field basis. All work should be done after
consulting the existing data in each field. This makes database design after an upgrade different from design when creating a new database. The appendix pro- vides a concise list of these procedures.
1. Study Standards Documents and Choose Several as Primary Sources There is no single authoritative reference for documentation standards. Each museum needs to choose several that will work for their purposes. The documents listed earlier in this paper, by ICOM-CIDOC, CHIN, and MDA are good examples for general sources. Each of these will have strengths and weaknesses, so it is advisable to consult at least three. Secondly, choose one or more discipline spe- cific documents. Many examples of discipline specific sources are listed on the ICOM, CHIN and MDA web pages. The International Union of Biological Sci- ences Taxonomic Database Working Group (TDWG) has a good website, and many of their standards resources have a botanical focus (IUBS-TDWG 2004).
Another example is Cataloguing Cultural Objects (CCO)—A Guide to Describing Cultural Works and Their Images (VRA 2003). It is a good idea to reflect what others in your institution and area are doing by including a reference on local standards. This can increase consistency locally and can be useful for topics like legal issues. Finally, you will need documents that are specific to individual fields in your database. Examples of this might include geographic references for lo- cality fields or date standards for fields handling dates.
2. Set Data Structure Standards
By considering field groups it is possible to identify the fields needed to ade- quately document collections. The minimum number of fields possible should be used while still following the three rules. For each field, define the field and its relationships to other fields. Produce a list of related fields including rules for entry. For example, describe the criteria for deciding whether a term should be recorded in Object Name, or a related field like Object Type.
3. Set Value Standards
Review each field to determine if it should have a controlled terminology. If so, determine if it should be a simple authority or hierarchical thesaurus and whether it should be self-maintained or an existing, established terminology re- source. A simple, self-maintained authority can often take the form of a field pop- up or pick list. Examples of existing sources include the Getty Art and Architec- ture Thesaurus (AAT) (Getty Research Institute 2004) or Integrated Taxonomic Information System (ITIS 2004). Consult Harvey and Young (1994) for an ex- tensive list of existing vocabularies and classification schemes. Most standards documents recommend using an existing source, but this can be difficult or im- practical for some fields.
4. Set Formatting Rules
CHIN, CIDOC, and SPECTRUM are only moderately helpful for providing formatting rules. Some formatting rules are defined, others can be deduced by looking at examples in their data dictionaries, but often no explanation is given.
Formatting rules are fairly straight forward and include whether a field can be written in natural language with normal grammar and punctuation or whether it
196 COLLECTION FORUM Vol. 21(1–2)
should be a single term, the number of occurrences, singular or plural form, capitalization rules, punctuation, abbreviations, language and formats for trans- lation if necessary, if a field is required, and how to deal with empty fields (e.g., write ‘‘unknown’’ or leave blank). Be very specific in descriptions of all format rules. Use existing formats wherever possible. Examples of available existing formats are date and name formats. For date, the only standards document that lists the International Standard ISO 8601 (Kuhn 2004) as the date standard used, is ICOM-CIDOC (1995). Several others use a standard that looks like ISO 8601, but do not state a source. It is proposed that all museums should be using ISO 8601 for standard date formats if their individual computer systems will support it. Dates will often require additional fields to handle all required information. An ISO 8601 full date formatted field may need to be supplemented with a date text field for date qualifiers (prior to, later than, uncertain), incomplete dates, ranges, or seasons. This will depend partly on the computer system capabilities and is something to be considered at stage 2 (set data structure standards) as well as this step (Berendsohn et al. 1999, CHIN 2004a, ICOM-CIDOC 1995). Like dates, personal and corporate names often need special consideration. Both CHIN’s Stan- dards (2004a) and ICOM-CIDOC’s Information Categories (ICOM-CIDOC 1995) suggest using the Anglo-American Cataloguing Rules (AACR) for formatting all types of names. This established format gives rules for personal names and cor- porate bodies as well as geographic names (Gorman 1999). As with a single date format, using a single established format for personal and corporate names across many institutions is considered a worthwhile goal.
5. Produce Data Dictionaries, Data Entry Manuals, Prompts and Online Helps Once the data dictionary is complete for each field, the other documents can be produced quite easily. CHIN (2004a), ICOM-CIDOC’s Information Categories (ICOM-CIDOC 1995), and SPECTRUM (MDA 2005) all contain data dictionar- ies that can be used as models for design. Anyone who has used CHIN is familiar with the term data dictionary. It is a description of the units of information, usually one for each field in the database. SPECTRUM (MDA 2005) and ICOM-CIDOC (1995) have their descriptions in alphabetical order and call them Information Requirements and Information Groups and Categories, respectively. It is beneficial to have a data dictionary in place from the beginning of the process, but this will need to be amendable as review and cleaning progress. Ensure that all these documents are in place before any new data entry begins.
Seven different fields or categories are suggested for inclusion in data dictio- naries. This should include fields for definition and relationships. These are both data structure standards. It should include rules of entry, both format standards and value standards. The value standards in existing data dictionaries are often very general, for example ‘‘maintain a list of standard terms’’ (MDA 2005). It would be possible to be more assertive and add to this first statement something like ‘‘enter terms from the Site Authority only’’ or ‘‘use terms from the Getty AAT only.’’ The data dictionary should include a field for data type. This is a format standard, for example ‘‘alpha-numeric string.’’ Including an Examples cat- egory can be very useful. Examples should be taken from the museum’s own data. A Source or Other Standards category will give a reference of where infor- mation for the field was found, or a list of other standards that use this same field.
CHIN (2004a) has a data dictionary category similar to this, and Bisby (1994) has an ‘‘Other Standards’’ used exactly this way. The final data dictionary cate- gory should be Logic or Rationale. For every single decision in each field, state the logic or reasons for the standards. Thelogicbehind this category is that well thought out decisions should be documented to avoid loss or duplication of labor.
6. Data Cleaning
The simplest way to clean data is to create a list of distinct field values with a case sensitive alphabetic sort. Scanning such a list will reveal spelling and other format errors. A less experienced worker can do this, or in-house IT people can design scripted tools (see Morris 2005 for a description of some of these auto- mated tools). In addition, someone with expert knowledge or someone who is very familiar with the collections will have to look over the lists to spot errors that casual workers or automated tools may miss. In this description, the dichot- omy of data cleaning becomes apparent. This division of cleaning types, simple/
expert, or IT/curatorial may account for differences in upgrade or review time anticipated by these two groups. When IT personnel say that data can be ‘‘easily migrated’’ they may only be taking into account simple data cleaning. In reality, this review or cleanup process must also include the type of extensive research that can only be done by more knowledgeable staff or by consulting experts. This cleanup could include researching taxonomic records, geographic references, field notes, or other original documents.
When a record with incorrect data is spotted, the entire record should be ex- amined to clarify the error. It may be that data was simply entered in the wrong field, and requires complementary changes. Rule number 3 applies here: never overwrite original data even if you are correcting mistakes. Of course there are drawbacks to following this rule too closely, like increasing the number of fields.
When the data is clean, it can be used to create pop-up or pick lists for those fields that need a simple, self maintained authority list. Data entry to these fields can be set to validate against these lists, or filling from the list can be made mandatory. These are examples of data quality control that can help ensure clean data through the rest of the database lifecycle. Examples of other quality controls can be found in Morris (2005).
This may all seem very mundane, but it has been a help to have a written set of procedures to manage the upgrade and review process. Available attention to the process may be sporadic. For example, casual employees, students, or vol- unteers who work a few hours a week for a term or two may do most of the actual work in a database. With this type of work force, it is difficult to have any sort of continuity in a project. A set of procedures helps this problem by giving step-by-step guidance that can be left and then picked up again at a later date by another worker.
All aspects of database design are necessary for this review. Work on the data, such as data cleaning, cannot proceed until a review of the documentation stan- dards is complete. It will usually be appropriate to follow these procedures in order, but some field groups may be more complicated than others and then it may be necessary to move between steps repeatedly to perfect the standards. The documentation of geographic locality information is an example of a complicated field group. Structure and formats for proximity, certainty, uncertainty, and other