John Tuck
13.1 Introduction
The consumer review appendix ofReview for JISC of Core Resource Discovery Services(Grayson and Gomersall, 2006, 52) shows that researchers want imme- diate and easy access to information.
“Many potential users have extremely limited awareness of the bibliographic services that are available to them... Reliance on Google is widespread... Almost all users have very limited searching skills... Even when using Google, many have never got beyond entering a single term... Many users give up easily... Academic users tend to expect immediate access to any useful documents that they find”.
This need for immediate information gratification was not the case when I first met Reg Carr in 1975. Googlization was far from my thoughts when, as a naive first-year research student, I entered the cataloguing room of the John Rylands University Library of Manchester, and asked to speak to the expert in French studies. This was Reg, who spent considerable time guiding me in the direction of relevant research tools for the study of French verse romances of the thir- teenth century.
Three years later, this time in possession of a job as an assistant librarian, I walked into the same cataloguing room. Reg had moved on. I was able to build on his work in developing the collections and making them accessible to the next cohort of University of Manchester students of French, encouraging them in their early days of library catalogues and research. Tools of the trade at that time were not web browsers, pdfs or HTML but razor blades (used by assistant librarians to erase and allow corrections to be made to individual letters typed on to five- by three-inch catalogue cards), multilith masters and typewriter ribbons.
Twenty-eight years later, I was struck by an extract from a letter to the editor of theOxford Magazine(Evans, 2006).
163
“The repeatedly-delayed Internal AuditReport on the Library Depository Project at Osney Mead: Review of the Business Casewas made available on the internal web in August 2006, but no URL has been published in theGazette,nor has the document itself. This does not really seem like reporting back to Congregation, as Congregation expected when it gave conditional permission for its land to be used for the new Depository under Statute XVI, A,4: “Recognizing the importance of the proposed book depository for the future delivery of library services in the University, and noting that there will be further reports to Congregation as later stages of the libraries’ estates strategy are planned in detail, Congregation approves the allocation of the site at Osney Mead for the purpose of developing an automated depository for Oxford University Library Services.” (http://www.ox.
ac.uk/gazette/2005–6/weekly/220905/agen.htm#15Ref) (6 October 2005)”.
Beautifully crafted letters of this type, frequently published in the Oxford Magazine, and often relating to library provision, reminded me of the four and three quarter years (from 1998 to 2002) I spent supporting Reg, as he guided Oxford libraries through the choppy waters to integration and the establish- ment of Oxford University Library Services (OULS) in February 2000. For a full account of integration and the process leading to it, see Moran (2005, 262–94).
During this eventful period, Reg taught me much more than a grasp of research tools for French studies. He conveyed the essential qualities of patience, perse- verance, determination and resilience in the face of any challenge. French liter- ature was never far away, however. In amongst the papers and reports on the desk of Bodley’s Librarian were odd copies ofCahiers Octave Mirbeauand Reg Carr, Anarchism in France: the case of Octave Mirbeau. A year or two after leav- ing Oxford, I opened up a copy ofSac au dos, a short story by Mirbeau, which, through its title, took me back to an observation Reg made to me just before I left. “John, when you move on, you will put down one heavy rucksack, and pick up another”. And this is what happened as I moved from integration at Oxford to web archiving at the British Library.
13.2 Web Archiving
Wikipedia (15 November 2006) states that:
“Web archiving is the process of collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for automated collection. The largest web archiving organization is the Internet Archive which strives to maintain an archive of the entire Web. National libraries, national archives and various consortia of organi- zations are also involved in archiving culturally important Web content”.
The history of web archiving in the British Library goes back to 2001 and a proof of concept pilot called domain.uk. Here an attempt was made to test the feasi- bility of harvesting a small number of UK websites (up to 100) on the basis of permissions having been granted by the website owners. The pilot proved successful and led to the establishment of a full web archiving programme, now firmly embedded in the British Library’s strategy.
The Library’s long-term vision for web archiving is that by 2016:
G there should be an extensive and accurate archive of the UK domain and the archive should be periodically augmented by further snapshots of the entire domain
G websites gathered automatically should be supplemented by more focused collection of areas of specific interest to the British Library
G the archive should form part of a much larger archive of the entire web
G the Library will work with partners throughout the world to ensure that researchers can access the world’s web archives in a consistent and seamless way
G in accordance with legal deposit legislation, material gathered from the web should be stored and be accessible for future generations of researchers.
As a step in this direction the Library’s current web archiving programme has three main strands: an underpinning collection development policy, work with the UK Web Archiving Consortium (UKWAC) and international developments managed through the International Internet Preservation Consortium (IIPC).
As a legal deposit library with a collecting remit for the UK, the British Library has a responsibility to acquire, preserve and make accessible the UK national published archive. Its British collections represent the collective memory of the nation by retaining for posterity the intellectual and creative output of British publishing. In addition to material acquired through legal deposit, the Library purchases and provides access to research-level material around the world and appropriate unpublished material in different formats. The collecting responsi- bility extends to UK non-print material, currently collected on a voluntary basis, but in the future to be covered by the provisions of the Legal Deposit Libraries Act 2003.
13.3 Web Archiving Policy
It is within this context that the British Library has defined its web archiving collection development policy which can be summarized as follows, and full details of which can be found at http://www.bl.uk/collections/
britirish/britishandirish.html:
“The British Library will collect sites selectively from the UK web space by prioritizing the archiving of sites of research value across the spectrum of knowl- edge. In addition, the Library will archive a selection of sites which are representa- tive of British cultural heritage in all its diversity across the regions. It will also archive a small number of sites which demonstrate web innovation. Once effective methodology is determined, the Library will also adopt a comprehensive strategy by taking periodic snapshots of the entire UK web presence”.
Included within the sites of cultural value category is a thematic strand where the focus is on collecting websites around key events of national life including topical events where there is likely to be frequent change of content and a significant risk of transient appearance on the web.
13.4 Challenges and Benefits
The complexity and scale of web archiving present challenges which need to be shared and worked through collaboratively. It is for this reason that UKWAC was established in October 2003. UKWAC comprises the British Library (lead partner), JISC (Joint Information Systems Committee), The National Archives, the National Library of Scotland, the National Library of Wales, and the Wellcome Library. It was set up with the objective of archiving and making accessible selected websites within the framework of a two-year pilot project and to share the costs and experiences of achieving that objective. These initial aims have been achieved and an open and freely accessible single portal to the web archive collections of the six partner institutions can now be accessed at www.webarchive.org.uk
The archive went live on 9 May 2005 and makes available more than 1600 differ- ent UK web sites, collected on a rights-cleared basis and all accessible free of charge to anyone on the web.
The content reflects the different collection development policies of the six partners and includes a number of thematic collections which reflect either events considered to be of topical importance or cultural significance. Among these are collections relating to the 2005 UK general election (93 sites), the 2004 Tsunami disaster (23 sites), the 7 July 2005 Terrorist attacks on London (49 sites) and sites relating to Women’s issues (117 sites).
Some brief words about two of these collections help illustrate the different challenges and benefits which derive from collaborative working. Three UKWAC partners (the British Library, the National Library of Wales and the National Library of Scotland) were involved in collecting the 2005 UK General Election sites. The very brief window of opportunity for capturing sites during and shortly before the election campaign meant that only a representative sam- ple could be archived. The challenge was to identify, evaluate and select sites as early as possible in the campaign to allow sufficient time to get obtain permis- sion to archive and to gather the sites before they disappeared after the election.
The labour-intensive nature of the permissions process resulted in a success rate of about 30% of the sample of 300 sites.
A further complexity was provided by the technical challenges facing the librar- ies. The web archiving teams had to put in place new procedures and workarounds as the high level of collaborative activity exposed some technical limitations of the PANDAS software which underpins the UKWAC web archiving system.
13.5 Joint Initiatives
The archiving of women’s issues sites was a joint initiative between the British Library and the Women’s Library, where the British Library took responsibility for the technical issues and the selection of sites was shared between the two
libraries. The aim was a simple one: to create a new resource for future researchers and to ensure that valuable information currently on the web about women is not lost. The project started in autumn 2005 with the goal of archiving 100 relevant sites over a six-month period. Selection of sites was coordinated, each library focusing on particular categories such as: sites with research content, women’s organizations and campaigns, personal sites of women, women’s e-zines and blogs, etc. The permissions success rate was again low, approximately 30%, but by Spring 2006 the target of 100 sites had been reached.
The sites are visited every six months and a further harvest made.
Work continues on other collection strands including Avian and pandemic influenza, the Olympics 2012 and Countryside issues. It is felt that the building up of collection-based resources will be particularly significant for researchers in the future, in particular in view of the ephemeral nature of much of the content gathered.
The work of UKWAC has demonstrated a low level of successful permissions to archive from website owners and has thereby raised real concerns about the via- bility and sustainability of web archiving on a voluntary rights-cleared basis.
The percentage of permissions granted as a proportion of permissions sought ranges from just above 26% for the British Library, to approximately 39% for both the National Library of Wales and the National Library of Scotland, and 45% for the Wellcome Trust Library. Outright refusals have been less than 1%, the remainder comprising non responses.
13.6 Regulation
Some analysis has been carried out on why web sites may choose not to respond.
This revealed the following possible factors: the legalistic nature of the permis- sions request, multiple and collective rightsholders of website material and the understandable reluctance to give clearance for third-party content on a site, the perception by rightsholders that their site may not be relevant or even worthy of archiving and a range of more administrative reasons such as requests sent to the wrong address as rightsholders are sometimes difficult to identify, lack of resource to respond, low priority given to the request, deletion of unfamiliar requests as spam, etc.
The situation demonstrates the need for early Regulation to enable secondary legislation to be put in place to allow legal deposit libraries to harvest relevant UK sites without having to seek permissions from website owners. The work on preparing for Regulations is being carried out by the Legal Deposit Advisory Panel, an advisory non-departmental public body set up to advise the Secretary of State on the necessary requirements and wording of Regulations which will make this possible. The Panel has been looking closely at the work of UKWAC and is fully aware of the labour-intensive nature of the permissions work and low returns as well as of the risk of loss of material on the web as sites edit, delete or revise their content. It is hoped that there will be a rapid path for Regulation, but even if this process is accelerated the statutory impact assessment and
consultation processes suggest that it is unlikely that any secondary legislation will come into force before 2008.
13.7 National Web Archiving Strategy
UKWAC has laid the foundations of a national web archiving strategy and a shared technical infrastructure for the UK. As well as to continue the work of collecting and making available UK web sites, the pilot project has been extended to allow UKWAC to focus on two key areas of strategic development:
the future coordination of UK web archiving at national level, taking into consideration the requirements of legal deposit legislation, and the evaluation and appraisal of software, platform and tools to ensure a future shared technical infrastructure for web archiving.
The technical development work is being carried out largely under the auspices of IIPC. (Current IIPC membership comprises: Bibliothèque nationale de France, National Library of Italy – Florence, Helsinki University Library – the National Library of Finland, Royal Library – National Library of Sweden, National and University Library – Iceland, Library and Archives Canada, National Library of Norway, National Library of Australia, the British Library and the Library of Congress.
13.8 Global Exchange and Collaboration
IIPC’s stated mission is to preserve and make accessible knowledge and infor- mation from the internet for generations everywhere, promoting global exchange and international relations. More specifically its goals, as defined on http://www.netpreserve.org/are
“to enable the collection of a rich body of Internet content from around the world to be preserved in a way that it can be archived, secured and accessed over time; to foster the development and use of common tools, techniques and standards to enable the creation of international archives; to encourage and support national libraries everywhere to address Internet archiving and preservation”
It is within the context of the development and use of common tools, tech- niques and standards, that the British Library is involved in two important developments: the Web Curator Tool (WCT) and the automated smart crawler.
The WCT is a joint project undertaken by the National Library of New Zealand and the British Library. It is a tool for managing the selective web harvesting process and is designed for use in libraries by non-technical users, allowing them easily to capture web content for archival purposes. It was developed dur- ing the first half of 2006, tested, and then released at http://webcurator.
sourceforge.net.
The WCT supports harvest authorization, i.e getting permission to harvest web material and make it available, selection, scoping and scheduling, description,
downloading the material at the appropriate time with the Heritrix web har- vester, quality review to ensure the harvest worked as expected and submitting the harvest results to a digital archive.
The WCT is not a digital archive, an access tool, a cataloguing system or a docu- ment management system. What it does is to manage the range of functions performed by the web archivist by providing a consistent managed approach.
The tool is currently being tested within the British Library infrastructure and linked to an access tool. This work is being carried out as a precursor to the UKWAC evaluation of tools and software to replace the existing PANDAS software.
The need for greater automation of large-scale processes, linked to avoidance of duplication and redundant effort, lies behind the smart crawler tool project being undertaken by the British Library, the Library of Congress, the Bibliothèque nationale de France and the Internet Archive. A tool is needed that will be able to collect large numbers of websites (running into millions) and which can be set up in such a way that it can prioritize the collection of certain websites over others, according to an agreed collection policy. The scope of the project, to be led by the British Library, is the development of the Heritrix crawler to give it the ability to: recognize when resources have not changed since the previous crawl, prioritize the order in which resources are visited and gathered from the internet, recognize when resources on the internet are chang- ing more frequently and visit and harvest those resources more frequently, per- form testing to ensure that the new capabilities are scalable up to at least 100 million resources and ensure that the web archiving community can take advantage of the new capabilities. Currently at the shaping stage, the project is planed for completion in late 2008.
A third initiative in which the British Library is involved is the Automated Con- tent Access Protocol (ACAP) pilot project led by Rightscom working with a number of international publishing bodies. The aim of the project, as defined by Rightscom, is to develop and implement a technical specification which will allow the publisher of a website or any piece of content to attach extra data, in a standardized form, to specify what uses of that piece of content or of the website are permissible. This may have particular relevance to web archiving by facili- tating the means of identifying the eligibility of sites to be harvested through legal deposit. It has the added advantage of being a project where libraries, through the British Library, are working together with publishers to find an agreed technical solution to a rights management issue.
13.9 Further Issues
As these collaborative projects develop in the complex world of web archiving, the web continues to grow both in scale and technical sophistication. There are issues of territoriality to be resolved – how can the nature of the UK domain be determined from a legal deposit perspective? There are issues of size – how many UK web sites are there and how many will there be in five or ten years