How much data have we lost?

Can we quantify the extent of the preservation problem for digital materials?

Why would we want to do this? We need to document some specific examples to supplement the handful of case studies (many of them anecdotal) that are currently all that can be found. Some of these examples are probably present in unpublished reports or business cases, but the publicly-accessible literature contains little. Beagrie suggests that ‘statistics on current losses are difficult to compile. Wider overviews are rare’ (Beagrie, 2004, pp.1–2). If we cannot produce specific examples, we lay ourselves open to the charge of being alarmist. A useful question to pose, then, is this: Is the problem of digital preservation as great as we have assumed?

One commonly-applied criterion for measuring the quality of a piece of research or scientific investigation is to ascertain that it has referred to (and preferably has built on) what has already been investigated; a literature review is the usual way in which the results of this investigation is presented. Another criterion is the presence of a clear statement of the problem to be investigated.

In other words, we expect a clear understanding of the nature and extent of the problem to be demonstrated. Applying these criteria to the preservation of digital information would call for, at the least, a literature search to ascertain, as accurately as possible, just how large the problem is. We need answers to these questions: How much digital information has been lost and how much has been compromised? To what extent has the data been compromised? Is the problem of digital preservation as great as we have assumed? We assume that there is a problem, and many statements, including those in this book, have been based on the assumption that there is a problem. Phrases such as ‘data losses through poor management of digital data’ (Beagrie and Greenstein, 1998, Preface) and ‘there is general consensus that digital preservation is both an imperative and a challenge’ (Lavoie, 2003, p.2) are readily found – but little else.

It is valid, therefore, to ask how large the problem is – it may even be crucial to the future of digital preservation, for instance to secure resources. An inability to answer this question could lead to scepticism about whether the problem is as great as claimed and to conjecture that we may have been needlessly

alarmed into believing the issue is more urgent than it really is. An attempt to answer these questions is especially valid in the context of the history of the preservation of documentary heritage. Two examples – brittle books and nitrate film – come to mind. In both of these the initial concerns were undoubtedly well- meaning and were based on the best information available at that time, but the assumptions on which action was taken have been established as faulty to varying extents, with serious consequences. The preceding section in this chapter noted the extent of the preservation problem for paper-based material. This concern precipitated expenditure of large amounts of money by the US National Endowment for the Humanities on its Brittle Books Program, which has micro- filmed about one million books since it began in 1988. But was this money well spent? There is now some evidence to suggest that the rate at which paper becomes brittle is not as high as the earlier studies, on which the Brittle Books Program was based, suggested. This has led to some strident critiques (the most notable among them that of Baker (2001)). Similarly, early projections about the rate at which cellulose nitrate film would deteriorate resulted in large-scale refor- matting of these films to cellulose acetate stock, but there is now considerable concern about the stability of the new ‘archival’ carriers (ANICA, 2002).

The examples of brittle books and nitrate film might suggest that we urge caution in the case of digital information. This is an excellent reason to put some effort at an early stage into attempting to quantify the extent of digital informa- tion loss or compromise, or, at the very least, to document some specific examples to supplement the few studies. The desirability of more documented examples and case studies has been recognized for some years. For example, Ross and Gow concluded that ‘information about data loss, recovery, and risk is very difficult to acquire . . . more case studies about data loss and rescue need to be collected’ (Ross and Gow, 1999, p.vi) and James noted that ‘documented examples of historically significant data loss are rare’ (James, 2001).

The term data loss used here also includes data that is compromised: it is degraded to the extent that its quality is affected. (The phrases loss of data integrity or loss of data authenticitymight also be used.) The data may still be accessible, but we have no clear idea of what they mean, what software was used to create them, and so on.

The question ‘How much digital information has been lost and how much has been compromised?’ is difficult to answer – maybe impossible to answer.

Although studies of the kind of data which should be preserved are readily available, such as scientific data (National Research Council, 1995), historical data (Higgs, 1998) and statistical data (Royal Statistical Society and UK Data Archive, 2002), no general estimates of quantity based on solid evidence (as distinct from conjecture) have been located, and few specific examples or care- fully documented case studies appear to exist. The same examples are trotted out, even where they are no longer in the ‘lost or compromised’ category: the BBC’s Domesday Project, NASA data, the Viking Mars mission, the Combat Area Casualty file containing prisoner of war and missing in action information for the Vietnam war, the first e-mail, the first web site, as described in more detail below.

It is doubtful that we will be able to accurately quantify the extent of loss.

If we could, the answer would be inextricably bound up with the issue of selec- tion for preservation. Are we assuming that we need to preserve most, perhaps even all, digital materials, rather than be selective? Do we have unrealistic 1111

expectations about the quantity of information (and not just digital information) that we want to preserve? As a crude (and unlikely) example, if we assume that all e-mails need to be kept, then the loss of any becomes a problem. One argu- ment is that anything significant is likely to be maintained anyway, so should we be concerned about the rest? Some of the examples which follow assume that the first (e-mail, web site, and so on) is worth preserving – but is this necessarily the case? It is, often, a view developed in hindsight. Betts tells us that Ray Tomlinson, principal engineer at BBN Technologies in Cambridge, Massachusetts did not save the first network e-mail ever sent in 1972 because

‘it just didn’t seem worth saving . . . Even if backup tapes did exist, they might not be readable. They were just mag tapes, and after seven or eight years, the oxide starts falling off, especially from tapes of that era’ (Betts, 1999). Selection is noted in more detail in Chapter 4.

Specific examples, although few in number, provide an indication of how great the problem of loss or compromise of digital materials may be. The most often quoted, indeed overused, examples are those cited in the 1996 report of the Task Force on Archiving of Digital Information. Because they have been reported very widely since, they warrant quoting at some length. The report notes the case of the US Census of 1960.

As it compiled the decennial census in the early sixties, the Census Bureau retained records for its own use in what it regarded as “perma- nent” storage. In 1976, the National Archives identified seven series of aggregated data from the 1960 Census as having long-term historical value. A large portion of the selected records, however, resided on tapes that the Bureau could read only with a UNIVAC type-II-A tape drive. By the mid-seventies, that particular tape drive was long obso- lete, and the Census Bureau faced a significant engineering challenge in preserving the data from the UNIVAC type II-A tapes. By 1979, the Bureau had successfully copied onto industry-standard tapes nearly all the data judged then to have long-term value.

The report notes the effect of this well-publicized ‘loss’:

the data rescue effort was a signal event that helped move the Committee on the Records of Government six years later to proclaim that “the United States is in danger of losing its memory.” The Committee did not bother to describe the actual details of the migra- tion of the 1960 census records. Nor did it analyze the effects on the integrity of the constitutionally-mandated census of the nearly 10 000 (of approximately 1.5 million) records of aggregated data that the rescue effort did not successfully recover. Instead, it chose to register its warning on the dangers of machine obsolescence in apocryphal terms.

With more than a little hyperbole, it wrote that “when the computer tapes containing the raw data from the 1960 federal census came to the attention of NARS [the National Archives and Records Service], there were only two machines in the world capable of reading those tapes:

one in Japan, and the other already deposited in the Smithsonian as a relic . . . Other examples . . . equally illustrate how readily we can lose our heritage in electronic form when the custodian makes no plan for

long-term retention in a changing technical environment. In 1964, the first electronic email message was sent either from the Massachusetts Institute of Technology, the Carnegie Institute of Technology or Cambridge University. The message does not survive, however, and so there is no documentary record to determine which group sent the path- breaking message. Satellite observations of Brazil in the 1970s, crucial for establishing a time-line of changes in the Amazon basin, are also lost on the now obsolete tapes to which they were written (Task Force on Archiving of Digital Information, 1996, pp.2–3).

Rothenberg reminds us of some examples noted in a 1990 US House of Representatives report:

hundreds of reels of tape from the Department of Health and Human Services; files from the National Commission on Marijuana and Drug Abuse, the Public Land Law Review Commission, the President’s Commission on School Finance, and the National Commission on Consumer Finance; the Combat Area Casualty file containing POW and MIA information for the Vietnam war; herbicide information needed to analyze the impact of Agent Orange; and many others (Rothenberg, 1999b, pp.1–2).

He reiterates the paucity of specific examples and offers a reason:

To date there appear to be few documented cases of unequivocal loss, but this may simply reflect the fact that documents or data that are recognized as important while they are still retrievable are the ones most likely to be preserved (Rothenberg, 1999b, p.2).

Another example frequently cited is the BBC Domesday Project. This project captured the national imagination in the UK and resulted in a multi-media version of the Domesday Book on videodisc, produced to mark the 900th anniversary of the original. It became inaccessible in the late 1980s as the hard- ware platform for which is was developed, the BBC microcomputer, became obsolete. The data has since been restored using emulation techniques (Digital Domesday Book Unlocked, 2002; CAMiLEON, 2002?; Mellor, 2003). This project is also noted in Chapter 7.

Cook’s 1995 call to action gives some Canadian examples of data loss where recordkeeping practices were ignored in the move to online recordkeeping.

Cook noted that ‘the National Archives of Canada . . . found not only that 30 out of 100 randomly chosen policy documents could not be found in the govern- ment’s paper records, but also that no system was in place to safeguard the contents of the electronic system’. Ontario Hydro’s nuclear power plant failed to keep adequate electronic or paper records of its construction and operation (Cook, 1995).

Some disciplines and industries have investigated their own practices and here the evidence is somewhat firmer. One of these is archaeology in the United Kingdom. The growth in computer use and the increase in digital data sets used by archaeologists resulted in the establishment of the Archaeology Data Service, part of the AHDS (Arts and Humanities Data Service), whose web site provides 1111

studies. One such study is of the Newham Museum Archaeological Service archives, described here as an indication both of the extent of data loss and of the reasons why data is lost (Dunning, 2001). The Newham Museum Archaeo- logical Service closed suddenly in 1998. Data on the hard disks of its computers were transferred onto floppy disks and were presented to the Archaeology Data Service (ADS) in the hope that the data could be retrieved. There were 239 disks with a total of 6350 files containing graphics, text and data in a variety of formats.

Twenty-five files were corrupted and 12 were unrecoverable. However, the lack of documentation made much that was recovered unusable for archaeological purposes because the data made no sense. This case study teaches us that

The loss of data from the Newham Archive is as much to do with poor project planning as it is to do with preservation . . . Preserving and docu- menting data should not be an additional chore at the end of a project, but an on-going process that is integrated into the creation of the data (Dunning, 2001).

Scientific data have been the focus of many studies. One is Preserving Scientific Data on Our Physical Universe (National Research Council, 1995), which indicates what scientific data were then available from United States scientific observa- tion and what they have been and might be used for. It includes some comments about what has survived. Space physics data are among these. This area of research has generated about 50 gigabytes of data per year over the last 30 years and much of this was ‘archived’ by sending the tapes, also sometimes relevant documentation, to the NSSDC (National Space Science Data Center). However,

‘there are many data at the NSSDC that most scientists would find difficult to use with only the information originally supplied’ (National Research Council, 1995, p.21). This report also notes the Landsat data, a large part of which resided

‘on tapes that cannot be read by any existing hardware. Recent data-rescue efforts have been successful in getting older data into accessible form, but these efforts are time-consuming and costly’ (National Research Council, 1995).

Humphrey gives examples of the research data generated by research funded by the Social Sciences and Humanities Research Council of Canada. Of a set of 150 studies from 1977 to 1980, only datasets from three could be located in 1998 (Humphrey, 2003).

Other examples come from a range of areas. For electronic journals, Warner notes ‘the irony of the demise’ of an archive established to archive electronic jour- nals due to lack of funding. The CICNet Journal Archive archived electronic journals from 1991 to 1997 but, according to Warner, has now vanished (Warner, 2002, quoting Wiggins, 2001). Warner also notes ‘the response to a query from Science Direct [Elsevier’s web platform] was that at least 2% of its electronic jour- nal content is missing’ (Warner, 2002). Commercial data recovery companies pro- vide snapshots. The web site of Authentec International provides six case studies of loss of data, much of which were subsequently recovered: for example, 15 per cent of the data on two diskettes chewed by a dog were not recoverable, and for a laptop computer damaged in the cargo hold of a plane 36 out of 112 bad sectors of its hard drive were unrecoverable (Authentec International, 2003). dataVault, citing research from CBL, Gartner Research, Meta Group, and SunGard Data Systems, reports that 94 per cent of all PC users surveyed experienced ‘at least one significant data loss, on average, once a year’, that fewer than one per cent

of companies perform daily data backups, and only 20 per cent of those that back up their data store backups offsite (Global Data Vault Inc., 2005). Although the web site of Ontrack Services does not provide specific examples of data loss, it does indicate the high costs of data recovery and the causes of data loss (Ontrack Data Recovery, 2004). While studies of document persistence on the web (such as Koehler, 2004) are primarily studies of access to data, they also throw some light on how much data is lost. Smith examined the longevity of a number of Australian winery web sites and concluded that, although a high percentage of them were still accessible, there was a significant loss of data within the sites (Smith, 2004). That the Electronic Literature Organization has mounted a project to identify threatened and endangered electronic literature and promote its pro- tection (Electronic Literature Organization, 2005) is another example of action motivated by the assumption that digital materials have been lost, or are, at the very least, are threatened with loss.

Current state of awareness of the digital

