I’ve heard so much about legal deposit in the context of web archiving, and have been enthralled with what it represents: a recognition that born-digital sources are today’s documentary record, the need to preserve it more, and the institutional and legal commitment to make sure that happens. If we’d had non-print legal deposit in 2008, historians might today be studying AOL Hometown, one of the early mass deletions on the Web.
But I knew that legal deposit came with some restrictions. In return for the legal authority for collecting libraries to collect all of this information, they were bound by many of the restrictions placed on print books: on-site consultation only, limitations on reproduction, and a maximum of one person at a time viewing a website.
I wondered how this would all work out, so on my way back from the Web Archives as Scholarly Sources conference in Denmark, I decided to make a quick two-day stop in London. There, I had the opportunity to stop by the UK Web Archive at the British Library. Helen Hockx-Yu, the Head of Web Archiving there, gave me a guided tour of the Web Archive and an opportunity to see both the user-facing interface as well as their back end. It helped complicate some of my views.
The User Experience: A Mixed Bag
If you want to view the UK’s legal deposit web archive, you need to physically go to one of their six legal deposit libraries: the British Library at King’s Cross in London, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries at Oxford University, the University Library of Cambridge University, or the Library of Trinity College, Dublin. Armed with a reader pass, you can go into one of the reading rooms and sit down at one of their reference terminals.
First, you’ll note that I have no pictures in this post – these computers are emblazoned with a very strict “no photography” rule.
In short, however, you load up the legal deposit web archive on the terminal, and are confronted with a search portal that allows you to do faceted keyword searches on the collection. You enter your search terms, which will be sorted by crawl data, and you can decide to view the oldest or the newest first. Subsequently, you can decide to refine by a number of automatically or manually-generated metadata fields: refine by author, content type (HTML, PDF, etc.), crawl year (1996, 1997, etc.), domain (.uk, .com, etc.), domain suffix (i.e. bl.uk, very handy in the UK web sphere which has these unlike Canada), or author (automatically generated).
Even with these facets, however, users will often be confronted with thousands or tens of thousands of results.
This is a good thing. Many of these results may not be held by the Internet Archive, they’re done in a regimented fashion, and they have the legal and institutional backing of the British Library. In this respect, legal deposit is doing its job: it has shocking amounts of information, hopefully perpetually stored. In decades, just having this material will be critical. If you don’t collect it, it dies.
It was also all incorporated within the British Library’s broader PRIMO search cataloguing, giving these websites institutional heft and aiding their discovery. For highly-targeted queries, this is fantastic.
Let’s quickly recap what’s fantastic about this:
- They have the content! Seriously, I don’t want anything I write to understate this critical point. Access is a problem today, as you’ll see, but if you don’t have the material you’ll never have any access.
- The content is discoverable: You can find it in PRIMO, or you can find it in their Wayback Machine instance. You have full-text search, something which you don’t currently have on the Internet Archive collections. The facted search options are robust.
- They’ve been able to do some other neat things: While a bit separate, the team is amazing: they’re leaders within the IIPC, the Shine search interface is fantastic – I’m using it right now – and they’re moving the yardsticks dramatically forward in this field.
But, after noting that we have the material now, there are some downsides.
Harder to Research than in a Traditional Archive
As you start navigating this archive, a few things began to pop out at me. Note that these aren’t endemic to all legal deposit collections – others, such as the collection at the Bibliothèque nationale de France, require only onsite visit. But the UK interpretation of non-print legal deposit has been very restrictive:
- It runs in a virtual machine: You’re running a virtual machine within a standard web browser. You can’t copy and paste anything outside of that virtual machine, and things are a bit slower than they would be. Most importantly, every time you fire up the archive, you’re confronted with a copyright warning message. It only takes a few seconds to dismiss, but it would get old pretty quickly.
- The pages are statically rendered: I gather that this is because of copyright issues, but the pages are not rendered in a manner akin to the traditional Wayback Machine. Instead, they’re almost generated images with hyperlinks. Dynamic content is all disabled.
- You can’t see hyperlinks that you’re clicking on: When you run your mouse over a link, the hyperlink doesn’t appear. So if you click on a resource and get a “404 Not Found” error, you don’t know what you’re missing! It would be nice to be able to pop over to the Internet Archive to compare things.
- You can’t take photos! This is the most critical issue. When I go to a traditional archive, I take literally thousands of digital photographs which I organize when I go home. I’m not alone: archives are now full of historians in poor postures. If I want to take notes in this web archive, I either have to have my laptop propped on my own lap (an ergonomic nightmare) or pay 26 pence/page to print content. That’s 52 Canadian cents a page. My research process would be slower using this archive than any other traditional archive that allowed digital photography.
- No two users can view a page at the same time: We tested this, and it’s true! Given the sheer size of these collections, and the amount of effort that must have taken, all I can say is that I’m sorry to the folks who had to implement this.
- Can’t view the source code of the page: I do tons of work with the source code of pages, especially in early collections. You can’t view it here.
I’d always thought it was a bit funny that you’d have to physically go somewhere and use your own laptop to do research at a legal deposit library, but visiting really underscored this. I could research .. hundreds of times quicker if these websites were printed off and in banker’s boxes than if I was using this interface.
I left the reading room wondering about whether legal deposit, at least in the near future, is really such a good thing. They’ve got a great team collecting an unparalleled collection of data, but can’t really provide access to it in a meaningful way.
They’re governed by print metaphor, and – essentially – that doesn’t work.
We then went into Helen’s office to take a look at the backend of the legal deposit archive, and my morale rose.
The Backend: w3act in action
After a quick opportunity to meet part of the UK Web Archive team (part of the team is also based at the West Yorkshire branch), Helen showed me the backend. They use the w3act annotation and curation tool, which is also available online to build. Researchers at other universities can even get w3act accounts with the British Library, letting them build collections and add metadata, but they understandably do not have off-site access to the legal deposit collection!
In a nutshell, while the entirety of the UK web domain is crawled once a year, many curators, researchers, and other staff may be interested in collecting material more frequently. They also may want to take sets of websites and group them into collections, to facilitate discoverability and also to capture pivotal events.
We did a few things!
- Building a Collection: Anybody with a w3act account can propose a collection. For example, there is a Nelson Mandela collection: websites relating to his life and death can be added to the collection. They then appear in the main reading room search engine as such, and appear as an additional facet.
- Override Crawl Setings: You can manually add websites, or tweak their crawl settings. If there’s a site that you know should be in the collection scope and it’s not, you can fill out a form and insert it into scope.
- Provide Robust User-Generated Metadata: With these overrides, you can decide to add a website – it does an automated check to make sure it can be crawled – and then add subject, collection, metadata. You can decide to make it a higher priority, flag quality assuracne issues, provide a short description. You hit a big green button to archive the site now, which I liked.
If a crisis broke out, you could quickly decide to scrape a bunch of relevant sites and build a collection around it. Indeed, there are a few hundred “key sites” that are crawled on a very frequent basis, such as government webpages, society and culture ones, or pivotal cultural sites.
No question about it: the BL is doing amazing things. They’re collecting epic amounts of information, are working at organizing it in a sensible way, and have provided reasonable access for people with targeted queries.
But the print metaphor, unfortunately, simply doesn’t work. Viewing one page at a time, constanly agreeing to copyright terms, and paying one Canadian dollar every time I want to print two pages, means that a simple research question could take weeks.
If a doctoral student working under me wanted to do their PhD on a topic involving this collection, I would warn them against it: it would be extremely difficult to pull off a time-limited project with this collection. Print sources are much quicker to use.
Which, if you know me, knows that it hurts my heart to say.
All hope is not lost, though: the back-end is robust, the collections are there. Maybe in a few decades this material can be unleashed and researchers can use them to their full potential?
My sincerest thanks to Helen Hockx-Yu for taking the time to show me around this archive. It was a fantastically rewarding experience.