• Tidak ada hasil yang ditemukan

International services

Dalam dokumen Preserving Digital Materials (Halaman 178-183)

Internet Preservation Consortium, JSTOR, DSpace, LOCKSS, and OceanStore, and the international collaborations noted are those of UNESCO, RLG, PADI, OCLC, and CAMiLEON.

International services

The Internet Archive (www.archive.org)

The Internet Archive, established in 1996 by Brewster Kahle, is a non-profit organization based in San Francisco whose aim is to provide permanent access 1111

2 3 4 51 6 7 8 9 10 1 2 3111 4 5 6 7 8 9 20111 1 211 3 4 5 6 71 8 9 30111 1 2 3 4 5 6 7 8 9 40111 1 2 3 44 45 46 47 48 49 50111

Services Alliances

International The Internet Archive UNESCO

International Internet Preservation RLG

Consortium PADI

JSTOR OCLC

DSpace CAMiLEON

LOCKSS OceanStore

Regional NEDLIB ERPANET

Digital Record- keeping Initiative

National Koninklijke Bibliotheek’s e-Depot Digital Preservation National Archives (UK) Digital Archive Coalition

Digital Curation Centre NDIIPP

AHDS Digital Preservation

UK Web Archiving Consortium Testbed

Sectoral PARADISEC JISC

CEDARS

Figure 9.1 Initiatives and Collaborations Described in this Chapter

International initiatives and collaborations 161

to digital materials, primarily those on the web. Its web site (on 4 January 2005) describes its aims as ‘working to prevent the Internet – a new medium with major historical significance – and other “born-digital” materials from disap- pearing into the past’ and promoting the ideals of ‘open and free access to literature and other writings’ which have ‘long been considered essential to education and to the maintenance of an open society.’ The Internet Archive is funded by donations from individuals and philanthropic organizations (among them the Hewlett Foundation, the Sloan Foundation, the Kahle/Austin Foundation, and the National Science Foundation) and by contract work it undertakes for bodies such as the Library of Congress, the national archives of the US and Britain, and the Bibliothèque nationale de France.

The Internet Archive contained over 400 terabytes at November 2004, the result of web crawls of all publicly accessible material on the web every two months plus, from 1999, targeted web crawls, some commissioned by specific organizations. This has resulted in several collections of web sites:

• The UK Central Government Web Archive: selected UK Government websites from 2003, collected for the National Archives (UK)

• Election 2002 Web Archive: almost 4000 web sites relating to the 2002 US elections, collected for the Library of Congress

• September 11th: archived web sites relating to the events of 11 September 2001 in the US, collected for the Library of Congress

• Election 2000: web sites relating to the US elections held in 2000, commis- sioned by the Library of Congress

• Web Pioneers: web sites illustrating the early years of the internet.

The Internet Archive is not, as is popularly believed, a complete depiction of the web. It does not capture password-protected sites, dynamically-generated content, and other material. It justifies its inclusive approach of archiving the entire publicly-available web by arguing that the cost of selection is greater and riskier than capturing all. The thorny question of intellectual property rights is addressed by collecting all publicly available material, but responding to requests for privacy and blocking access if a site’s owner requests this.

Technically, the Internet Archive is based on a large number of personal computers with IDE hard disks. The data is stored on DLT tapes and hard drives in ARC file format, a format being promoted as a standard for archiving internet material by Alexa Internet, who supplies the Archive with its web crawls.

(Specifications for the ARC file format are available at pages.alexa.com/

company/arcformat.html.) Copies of the collections are maintained at several sites, currently in Egypt and Holland. The data is migrated as appropriate. To counter data format obsolescence the Archive intends to collect appropriate soft- ware and emulators. The Internet Archive has until recently been primarily interested in capturing and storing web sites, but is now turning its attention to providing access. The Wayback Machine, launched in 2001, provides access to archived versions of web pages.

The Internet Archive is active in research and development, for example to develop better web crawlers and better user middleware. It actively seeks collab- oration, such as in the International Internet Preservation Consortium (netpreserve.org), formed in 2003 (described below), and in the Open Access Text Archive, a collaboration with libraries from many countries, which is

162 Digital Preservation Initiatives and Collaborations

putting digitized books into open-access archives. In December 2004 over 27 000 books were available, with another 50 000 expected in the first quarter of 2005;

the goal is one million books.

The Internet Archive has influenced digital preservation by demonstrating that large quantities of web materials can be archived over time. (This section is based on Lyman and Kahle (1998), Smith (2003) and the Internet Archive’s web site (www.archive.org).)

The International Internet Preservation Consortium (netpreserve.org)

The International Internet Preservation Consortium (IIPC) was formed in 2003, initially for three years. Its members are the national libraries of Australia, Canada, Denmark, Finland, Iceland, Italy, Norway, and Sweden, the British Library, the Library of Congress, and the Internet Archive, with overall coordi- nation by the Bibliothèque nationale de France. The goals of the consortium are:

• To enable the collection of a rich body of Internet content from around the world to be preserved in a way that it can be archived, secured and accessed over time

• To foster the development and use of common tools, techniques and stan- dards that enable the creation of international archives

• To encourage and support national libraries everywhere to address Internet archiving and preservation (netpreserve.org).

It has six working groups: Framework, Researchers Requirements, Access Tools, Metrics and Testbed, Deep Web, and Content Management. To date it has provided two reports available on the Consortium’s web site: one that presents a taxonomy of the challenges that web crawlers encounter when trying to copy content for web archiving, and another that classifies the conditions encoun- tered on web sites (such as static HTML documents, forms, JavaScript) and describes the issues that arise in harvesting these. The IIPC is developing soft- ware tools for web archiving and this has resulted in products such as Heretrix (an archive quality web crawler) and DeepArc (software that extracts database content to XML flat files). The aim is to develop a toolkit of web archiving soft- ware that is open source and easy to install. If it achieves this aim, our ability to archive web content will be considerably enhanced. (This section is based on information from the International Internet Preservation Consortium’s web site (netpreserve.org) and from comments made during the Archiving Web Resources International Conference, Canberra, 9–11 November 2004.)

JSTOR (www.jstor.org)

JSTOR, a non-profit organization based in the US, was established in 1994, initially to address the problems libraries faced in providing storage space for long runs of scholarly journals. It now has two aims: to develop a trusted archive of important scholarly journals, and to provide access to these journals as widely as possible. Today, users of JSTOR from many countries can search and retrieve high resolution images of journal issues and pages: 2 224 participants in November 2004, of whom 909 were located outside the US in 86 countries.

JSTOR’s content covers scholarly journals with a focus on the humanities and 1111

2 3 4 51 6 7 8 9 10 1 2 3111 4 5 6 7 8 9 20111 1 211 3 4 5 6 71 8 9 30111 1 2 3 4 5 6 7 8 9 40111 1 2 3 44 45 46 47 48 49 50111

International initiatives and collaborations 163

social sciences: 457 titles and 16.6 million pages in November 2004 (statistics at 23 November 2004 come from the JSTOR web site).

Each journal title is digitized from its first issue as a 600dpi TIFF image and a text file for searching purposes. JSTOR’s preservation strategies focus on stable standard formats, data backup, and redundancy (Heterick, 2002, pp.114–116).

JSTOR is extending its activities into born-digital materials as one partner in the Electronic-Archiving Initiative (or E-Archive). This seeks to preserve the publishers’ versions (the source files) of e-journals, because these often contain material that is not presented in the online versions, for example high resolu- tion image files. To date planning and discussions with publishers and libraries have occurred, a working prototype archive has been developed, a production- level archive is now in development, and verification and normalization procedures are being developed.

JSTOR’s significance for digital preservation lies mainly in its business model, which provides benefits to all of its stakeholders – publishers, libraries, and users. It is subscription-based, and charges an initial one-off fee to all subscribers, which supports the costs incurred in digitizing journals and managing the resulting files. This business model, Smith suggests, ‘promises to be sustainable over time’ (Smith, 2003, p.21). (This section is based on Heterick (2002), Smith (2003) a presentation by Eileen Gifford Fenton at the DPC Forum, London, 23 June 2004, and the JSTOR web site (www.jstor.org).)

DSpace (www.dspace.org)

DSpace began life as a sectoral initiative, as an institutional repository for the material produced by staff of MIT (Massachusetts Institute of Technology). At an early stage in its development DSpace was made available to other institu- tions and has been adopted around the world. DSpace is an institutional repository ‘designed to enable the capture, distribution and preservation of the intellectual output of MIT’ but always ‘with a view to its adoption by, and federation with, other institutions’ (Greenan, 2003, p.8). Developed jointly by MIT Libraries and Hewlett-Packard, it was trialled within MIT from February 2002 and launched in September 2002. The benefits of using DSpace, according to its publicity, include its making content quickly and widely available, and its indexing capabilities; preservation benefits include

• ‘Long–term preservation for a variety of digital formats including text, audio, video, images, datasets, and more

• A long-term stable URL that can be used in a citation to link to items in DSpace’ (DSpace Communications Kit,2003).

A significant characteristic of DSpace is that it is an open source system. The source code for DSpace was made available under an Open Source licence in November 2002. These licences are promoted by the OpenArchives Initiative, whose aims are to develop protocols and interoperability standards and tools to promote interoperability among multiple repositories (see www.openar- chives.org). Since November 2002 it has been downloaded by several thousand institutions, more than 1500 in 2003 alone (Greenan, 2003, p.8).

DSpace distinguishes between ‘bit preservation’ and ‘functional preservation’.

Bit preservation

164 Digital Preservation Initiatives and Collaborations

ensures that a file remains exactly the same over time – not a single bit is changed – while the physical media evolve around it. Functional preservation goes further: the file does change over time so that the material continues to be immediately usable in the same way it was originally while the digital formats (and physical media) evolve over time (DSpace FAQ,2003).

DSpace recognizes that faculty will create digital materials in a wide variety of formats to suit their own aims, and that repositories must, therefore, handle these formats. It accepts all forms of digital materials and defines three levels of preservation for file formats – supported, known, or unsupported. Bit-level preservation is carried out for all three levels. Supported file formats (those that are open and ‘archival’, such as TIFF, SGML, XML, AIFF, and PDF) are func- tionally preserved using format migration or emulation techniques. Known file formats (popular proprietary formats such as Microsoft Word and Powerpoint, Lotus 1–2-3, and WordPerfect) will rely on the likelihood that third party format migration tools will be developed for them. Unsupported file formats (those about which little is known, such as unique software programs) will not have any functional preservation applied. DSpace developers are working on proce- dures for uploading processes which will convert unsupported or known formats to supported ones, and is also developing its capability to carry out regular format migrations. DSpace assigns to material submitted ‘a unique iden- tifier, stores provenance information, maintains an auditable history and record of changes to the archive [and] provides persistent storage’ (Sullivan et al., 2004).

MIT also encourages collaboration among DSpace participants. An early collaboration was the DSpace@Cambridge Project (www.lib.cam.ac.uk/dspace) from 2003. As well as implementing a repository at Cambridge University, this collaboration is intended to act as a UK model. In 2003 the DSpace Federation was established, of which all institutions that have implemented DSpace are members. The intention of the DSpace Federation is to share ‘technical innova- tion, content, and services’ and ‘to promote interoperability among institutional repositories to support distributed services, virtual communities, virtual collec- tions, and cataloging’. This will happen through activities such as sharing in the development and maintenance of the DSpace source code, and promoting the DSpace service and interoperability of archival repositories.

In Australia, DSpace is implemented at the ANU (Australian National University), Canberra (dspace.anu.edu.au). In 2004 ANU tested DSpace as a plat- form for its digital repository, and was trialling a Google DSpace search facility.

DSpace is also being trialled at the University of Melbourne to ascertain its suit- ability for maintaining and accessing resources located in the University of Melbourne’s Percy Grainger Museum (Sullivan et al,. 2004).

DSpace’s influence in digital preservation is growing because it provides a framework in which academic libraries and archives can develop strategies and practices in a collaborative international environment. (This section is based on Barton and Walker (2003), Greenan (2003), Smith (2003), Sullivan et al. (2004) and the DSpace web site (www.dspace.org).)

LOCKSS (lockss.stanford.edu)

The LOCKSS (Lots Of Copies Keep Stuff Safe) project is noted in detail in Chapter 8. It is based on the well-established preservation principle of 1111

2 3 4 51 6 7 8 9 10 1 2 3111 4 5 6 7 8 9 20111 1 211 3 4 5 6 71 8 9 30111 1 2 3 4 5 6 7 8 9 40111 1 2 3 44 45 46 47 48 49 50111

International initiatives and collaborations 165

redundancy (keeping multiple copies as a safeguard against loss). LOCKSS is significant in digital preservation terms because it established the feasibility of replication and peer-to-peer polling using standard personal computers.

OceanStore (oceanstore.cs.berkeley.edu)

The peer-to-peer concept demonstrated in LOCKSS is also the basis of OceanStore, whose web site describes it as a

global persistent data store designed to scale to billions of users. It provides a consistent, highly-available, and durable storage utility atop an infrastructure comprised of untrusted servers.

OceanStore’s features include data protection through redundancy and through its cryptographic techniques. Any computer can join the OceanStore infrastruc- ture, and users subscribing only to a single OceanStore service provider can access other OceanStore servers. Data is cached in OceanStore ‘promiscuously:

any server may create a local replica of any data object’. This offers benefits such as faster access. OceanStore claims that it can offer ‘durability which exceeds today’s best by orders of magnitude’ because digital material is stored on ‘hundreds or thousands of servers’ and so can be readily reconstructed. ‘Only a global-scale disaster’, they claim, ‘could disable enough machines to destroy the archived object’. Pond, a prototype of OceanStore, is being developed.

(This section is based on information available on the OceanStore web site (oceanstore.cs.berkeley.edu).)

Dalam dokumen Preserving Digital Materials (Halaman 178-183)