Web Archive Legal Deposit: A Double-Edged Sword

I couldn't take pictures of the web archive, but here's the lineup to enter the archive at 9:30am! — I couldn’t take pictures of the web archive, but here’s the lineup to enter the archive at 9:30am!

I’ve heard so much about legal deposit in the context of web archiving, and have been enthralled with what it represents: a recognition that born-digital sources are today’s documentary record, the need to preserve it more, and the institutional and legal commitment to make sure that happens. If we’d had non-print legal deposit in 2008, historians might today be studying AOL Hometown, one of the early mass deletions on the Web.

But I knew that legal deposit came with some restrictions. In return for the legal authority for collecting libraries to collect all of this information, they were bound by many of the restrictions placed on print books: on-site consultation only, limitations on reproduction, and a maximum of one person at a time viewing a website.

I wondered how this would all work out, so on my way back from the Web Archives as Scholarly Sources conference in Denmark, I decided to make a quick two-day stop in London. There, I had the opportunity to stop by the UK Web Archive at the British Library. Helen Hockx-Yu, the Head of Web Archiving there, gave me a guided tour of the Web Archive and an opportunity to see both the user-facing interface as well as their back end. It helped complicate some of my views.

The User Experience: A Mixed Bag

If you want to view the UK’s legal deposit web archive, you need to physically go to one of their six legal deposit libraries: the British Library at King’s Cross in London, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries at Oxford University, the University Library of Cambridge University, or the Library of Trinity College, Dublin. Armed with a reader pass, you can go into one of the reading rooms and sit down at one of their reference terminals.

First, you’ll note that I have no pictures in this post – these computers are emblazoned with a very strict “no photography” rule.

In short, however, you load up the legal deposit web archive on the terminal, and are confronted with a search portal that allows you to do faceted keyword searches on the collection. You enter your search terms, which will be sorted by crawl data, and you can decide to view the oldest or the newest first. Subsequently, you can decide to refine by a number of automatically or manually-generated metadata fields: refine by author, content type (HTML, PDF, etc.), crawl year (1996, 1997, etc.), domain (.uk, .com, etc.), domain suffix (i.e. bl.uk, very handy in the UK web sphere which has these unlike Canada), or author (automatically generated).

Even with these facets, however, users will often be confronted with thousands or tens of thousands of results.

This is a good thing. Many of these results may not be held by the Internet Archive, they’re done in a regimented fashion, and they have the legal and institutional backing of the British Library. In this respect, legal deposit is doing its job: it has shocking amounts of information, hopefully perpetually stored. In decades, just having this material will be critical. If you don’t collect it, it dies.

It was also all incorporated within the British Library’s broader PRIMO search cataloguing, giving these websites institutional heft and aiding their discovery. For highly-targeted queries, this is fantastic.

Let’s quickly recap what’s fantastic about this:

They have the content! Seriously, I don’t want anything I write to understate this critical point. Access is a problem today, as you’ll see, but if you don’t have the material you’ll never have any access.
The content is discoverable: You can find it in PRIMO, or you can find it in their Wayback Machine instance. You have full-text search, something which you don’t currently have on the Internet Archive collections. The facted search options are robust.
They’ve been able to do some other neat things: While a bit separate, the team is amazing: they’re leaders within the IIPC, the Shine search interface is fantastic – I’m using it right now – and they’re moving the yardsticks dramatically forward in this field.

But, after noting that we have the material now, there are some downsides.

Harder to Research than in a Traditional Archive

As you start navigating this archive, a few things began to pop out at me. Note that these aren’t endemic to all legal deposit collections – others, such as the collection at the Bibliothèque nationale de France, require only onsite visit. But the UK interpretation of non-print legal deposit has been very restrictive:

It runs in a virtual machine: You’re running a virtual machine within a standard web browser. You can’t copy and paste anything outside of that virtual machine, and things are a bit slower than they would be. Most importantly, every time you fire up the archive, you’re confronted with a copyright warning message. It only takes a few seconds to dismiss, but it would get old pretty quickly.
The pages are statically rendered: I gather that this is because of copyright issues, but the pages are not rendered in a manner akin to the traditional Wayback Machine. Instead, they’re almost generated images with hyperlinks. Dynamic content is all disabled.
You can’t see hyperlinks that you’re clicking on: When you run your mouse over a link, the hyperlink doesn’t appear. So if you click on a resource and get a “404 Not Found” error, you don’t know what you’re missing! It would be nice to be able to pop over to the Internet Archive to compare things.
You can’t take photos! This is the most critical issue. When I go to a traditional archive, I take literally thousands of digital photographs which I organize when I go home. I’m not alone: archives are now full of historians in poor postures. If I want to take notes in this web archive, I either have to have my laptop propped on my own lap (an ergonomic nightmare) or pay 26 pence/page to print content. That’s 52 Canadian cents a page. My research process would be slower using this archive than any other traditional archive that allowed digital photography.
No two users can view a page at the same time: We tested this, and it’s true! Given the sheer size of these collections, and the amount of effort that must have taken, all I can say is that I’m sorry to the folks who had to implement this.
Can’t view the source code of the page: I do tons of work with the source code of pages, especially in early collections. You can’t view it here.

I’d always thought it was a bit funny that you’d have to physically go somewhere and use your own laptop to do research at a legal deposit library, but visiting really underscored this. I could research .. hundreds of times quicker if these websites were printed off and in banker’s boxes than if I was using this interface.

That’s wrong!

I left the reading room wondering about whether legal deposit, at least in the near future, is really such a good thing. They’ve got a great team collecting an unparalleled collection of data, but can’t really provide access to it in a meaningful way.

They’re governed by print metaphor, and – essentially – that doesn’t work.

We then went into Helen’s office to take a look at the backend of the legal deposit archive, and my morale rose.

The Backend: w3act in action

After a quick opportunity to meet part of the UK Web Archive team (part of the team is also based at the West Yorkshire branch), Helen showed me the backend. They use the w3act annotation and curation tool, which is also available online to build. Researchers at other universities can even get w3act accounts with the British Library, letting them build collections and add metadata, but they understandably do not have off-site access to the legal deposit collection!

In a nutshell, while the entirety of the UK web domain is crawled once a year, many curators, researchers, and other staff may be interested in collecting material more frequently. They also may want to take sets of websites and group them into collections, to facilitate discoverability and also to capture pivotal events.

We did a few things!

Building a Collection: Anybody with a w3act account can propose a collection. For example, there is a Nelson Mandela collection: websites relating to his life and death can be added to the collection. They then appear in the main reading room search engine as such, and appear as an additional facet.
Override Crawl Setings: You can manually add websites, or tweak their crawl settings. If there’s a site that you know should be in the collection scope and it’s not, you can fill out a form and insert it into scope.
Provide Robust User-Generated Metadata: With these overrides, you can decide to add a website – it does an automated check to make sure it can be crawled – and then add subject, collection, metadata. You can decide to make it a higher priority, flag quality assuracne issues, provide a short description. You hit a big green button to archive the site now, which I liked.

If a crisis broke out, you could quickly decide to scrape a bunch of relevant sites and build a collection around it. Indeed, there are a few hundred “key sites” that are crawled on a very frequent basis, such as government webpages, society and culture ones, or pivotal cultural sites.

Conclusions

No question about it: the BL is doing amazing things. They’re collecting epic amounts of information, are working at organizing it in a sensible way, and have provided reasonable access for people with targeted queries.

But the print metaphor, unfortunately, simply doesn’t work. Viewing one page at a time, constanly agreeing to copyright terms, and paying one Canadian dollar every time I want to print two pages, means that a simple research question could take weeks.

If a doctoral student working under me wanted to do their PhD on a topic involving this collection, I would warn them against it: it would be extremely difficult to pull off a time-limited project with this collection. Print sources are much quicker to use.

Which, if you know me, knows that it hurts my heart to say.

All hope is not lost, though: the back-end is robust, the collections are there. Maybe in a few decades this material can be unleashed and researchers can use them to their full potential?

My sincerest thanks to Helen Hockx-Yu for taking the time to show me around this archive. It was a fantastically rewarding experience.

Hi Brewster – thanks for the comments and great questions.

The browsing experience for both – via the Wayback Machine – is great for closely reading websites: dynamic content is sometimes preserved (depending on a number of factors on both ends), temporal coherence is largely preserved during the browsing experience, and most importantly because it’s web accessible users can use it as they see fit.

When I’ve done research with the Wayback Machine, I’ve sometimes used it as a traditional historian might – clicking through pages and taking notes, saving a copy when appropriate – as well as interacting programmatically with the collections.

The downside of archive.org’s Wayback Machine, compared with the BL’s faceted search, is that to discover a site you largely need to know the URL. In periods where the Yahoo! directory is strong, for example, I often use that as my main avenue to find a website (say a site relating to concerned parents in 1998). But as the Yahoo! directory declines, I have to find the URL elsewhere and that gets challenging. Apart from very specific queries, I don’t actually think a full-text search function would magically fix this.

For Archive-It, they offer a pretty similar default experience as the BL, especially with full-text search, minus the restrictions posited above (i.e. I can research using these collections, save copies/PDF/screenshots/etc. at my pleasure). The most promising part of Archive-It to my biased self is the Archive-It Research Services: the WAT files give me the ability to use metadata to discover hubs of activity or important files, and the WARCs themselves have a ton of opportunity.

This of course requires a great relationship with the host institution (I’ve been fruitfully working with Archive-It and the University of Toronto Libraries). But when that’s possible, Archive-It really shines through.

For what it’s worth, we’ve been having a lot of success working with the WATs/WARCs – using warcbase (https://github.com/lintool/warcbase/wiki). I think the Wayback Machine is a great, accessible tool for the majority of endusers, but historians undertaking big studies will need access to more metadata. In a dream world, I think the Wayback Machine coupled with robust derivative metadata – i.e. a network graph/key terms/topics/entities/etc. – would help a historian find better needles in the haystack?

12 thoughts on “Web Archive Legal Deposit: A Double-Edged Sword”

Daniel Tobias says:

14 July 2015 at 9:38 am

It must have taken great effort on the part of their technical people to make digital resources even more cumbersome to use than print.

1. Ian Milligan says:
  
  14 July 2015 at 11:12 am
  
  I gather that it did – it’s too bad that such talented people had to spend time on that, rather than their core issues of preserving and providing access to these great collections!
  
Niels Brügger (@NielsBr) says:

14 July 2015 at 11:04 am

Thanks for the summary, Ian. However, the issues you raise are not inherent in legal deposit as such. National differences apply, the Danish Netarkivet (also legal deposit) has online access, for several users simultanouosly, you can paste out of archive, webpages are not statically rendered, and you can view the source code. The big advantage of legal deposit is that the web archive is funded and taken care of ‘for eternity’

1. Ian Milligan says:
  
  14 July 2015 at 11:11 am
  
  Thanks Niels – I tried to sneak a proviso in there above (“these aren’t endemic to all legal deposit collections”) and keep it focused on the BL, but the title could certainly be more focused on the British example itself.
  
  If anything, I hope that the BL can use the example of the Netarkivet when reviewing their practices! I was unclear when exploring their website: one needs to be a registered researcher to have online access to the collection? Is it difficult to get such status or is it granted for most/all reasonable requests?
  
Brewster Kahle says:

15 July 2015 at 7:39 am

How would you compare BL’s or other national library systems to using the Internet Archive’s Archive-it or Wayback Machine? I am curious because I work with the Internet Archive.

1. Ian Milligan says:
  
  15 July 2015 at 9:09 am
  
  Hi Brewster – thanks for the comments and great questions.
  
  The browsing experience for both – via the Wayback Machine – is great for closely reading websites: dynamic content is sometimes preserved (depending on a number of factors on both ends), temporal coherence is largely preserved during the browsing experience, and most importantly because it’s web accessible users can use it as they see fit.
  
  When I’ve done research with the Wayback Machine, I’ve sometimes used it as a traditional historian might – clicking through pages and taking notes, saving a copy when appropriate – as well as interacting programmatically with the collections.
  
  The downside of archive.org’s Wayback Machine, compared with the BL’s faceted search, is that to discover a site you largely need to know the URL. In periods where the Yahoo! directory is strong, for example, I often use that as my main avenue to find a website (say a site relating to concerned parents in 1998). But as the Yahoo! directory declines, I have to find the URL elsewhere and that gets challenging. Apart from very specific queries, I don’t actually think a full-text search function would magically fix this.
  
  For Archive-It, they offer a pretty similar default experience as the BL, especially with full-text search, minus the restrictions posited above (i.e. I can research using these collections, save copies/PDF/screenshots/etc. at my pleasure). The most promising part of Archive-It to my biased self is the Archive-It Research Services: the WAT files give me the ability to use metadata to discover hubs of activity or important files, and the WARCs themselves have a ton of opportunity.
  
  This of course requires a great relationship with the host institution (I’ve been fruitfully working with Archive-It and the University of Toronto Libraries). But when that’s possible, Archive-It really shines through.
  
  For what it’s worth, we’ve been having a lot of success working with the WATs/WARCs – using warcbase (https://github.com/lintool/warcbase/wiki). I think the Wayback Machine is a great, accessible tool for the majority of endusers, but historians undertaking big studies will need access to more metadata. In a dream world, I think the Wayback Machine coupled with robust derivative metadata – i.e. a network graph/key terms/topics/entities/etc. – would help a historian find better needles in the haystack?
  
Francesca Ortolano (@OrtFrancesca) says:

15 July 2015 at 10:08 am

In May in Italy the National Italian Association of Archivists (ANAI) organised the annual Worskhop “Digital Records. Beyond the Norm to Share Good Practice” focused on “Web Archiving. On Web as Universitas Rerum: Selecting, Describing, Preserving”. Helen Hockx-Yu from BL and Sara Aubry from Bibliothèque nationale de France were two of speakers. It was created a interesting comparison between the two models. Papers are available here: http://www.documento-elettronico.it/workshop/workshop-2015/atti-della-giornata.
Hope it’d be of interest.
Francesca Ortolano – Italian Archivis

1. Ian Milligan says:
  
  15 July 2015 at 3:09 pm
  
  These are fascinating presentations: thanks so much for sharing them, Francesca, it’s really appreciated (and I’m sure other readers will find them interesting too).
  
Pingback: Ian Milligan et le dépôt légal du Web: visite à la British Library | Web90 – Patrimoine, Mémoires et Histoire du Web dans les années 1990
Fiona Laing says:

20 July 2015 at 11:38 am

Reblogged this on SWOP Forum.

Trevor Thomson says:

30 September 2015 at 4:38 am

Hi Ian, what a very interesting piece. As a student of librarianship I did a lot of work on the legislation leading up to the implementation of non-print legal deposit in the UK and Ireland – and you touched on the issue in your article, collecting the material was, and is, key.
The original legislation was passed in 2003 (after many years of wrangling and argument about non-print material) and the enabling legislation took a further ten years to be passed before anything could actually be collected under the legal deposit privilege. The drawbacks you outline, which I agree are drawbacks, are a consequence of getting the ability to collect over the legal hurdles. You are absolutely right, remote and multi-user access are desirable but without the restrictions of the ‘print-metaphor’ (an excellent phrase) I doubt the UK and Ireland would have the privilege they have now. It’s frustrating but I recall publishers of digital material making comparison to digital music which was so easily pirated because of lack of control – free access in a legal deposit library was a major worry!

Pingback: When is web archiving not web archiving? | North West Region Digital Preservation Group