Web Archives and Born-Digital Sources Workshop: Challenges, Future Steps, and the Field

On June 8th, I had the pleasure of attending “Born digital big data and approaches for history and the humanities,” a workshop hosted by the University of London’s School of Advanced Study. You can see the full program of the day here. It’s part of an AHRC research network that I’m part of.

With Peter Webster and Jason Webber, I participated in a roundtable discussion on web archives, moderated by Jane Winters. Jane asked us four questions:

  1. What do you think is unique about web archives, particularly in relation to other types of born digital data?
  2. What are the key challenges facing researchers when working with web archives?
  3. What should we be doing that we’re not currently doing, in order to ensure that web archives can be accessed now and in the future? What are the barriers?
  4. Talk about the most interesting project/piece of research you’ve been involved with.

I had a few responses:

What do you think is unique about web archives, particularly in relation to other types of born digital data?

I saw five things as standing out about web archives.

  • Largely-unstructured content: While web archives have a lot of structure in the form of their metadata, such as hyperlinks, crawl information, elements used, sometimes the <meta> tags themselves, the juicy historical information inside is generally unstructured and varied. Text, images, and the such.
  • Widely divergent content: Web archivists are currently in cat-and-mouse game with new standards – from Flash (presenting tremendous issues and largely lost) to problems like user-interaction (tackled with things like Umbra). We also are now seeing information coming in from APIs, such as social media (archiving Twitter is different than web archiving, but I think of them as peas in a pod).
  • Sheer scale: Lots of born-digital data is out there.. but web archives aren’t small. We’re talking petabyte scale.
  • Relatively undeveloped researcher ecosystem: There aren’t too many researcher platforms out there, aside from Archive-It Research Services, ArchiveSpark, and Warcbase. While there’s a commercial imperative for working with many other kinds of born-digital sources, i.e. news sources which businesses and others can fruitfully exploit, that’s not quite there for web archives. This means that the technology ecosystem around web archives is developed b y academics, national libraries, and researchers… not Google.
  • Content creator types: I’ve rambled on about this before ,but content is created by many different types of people – i.e. Twitter users, bloggers, etc. Lots of everyday people, jumbled in with government officials and businesses.

What are the key challenges facing researchers when working with web archives?

  • See above. But a few things stand out:
  • Training: Social scientists and humanists are going to need training how to fruitfully work with this material at scale – it’s a constantly evolving ecosystem, which suggests to me that general-purpose computing skills are more important than specific platforms.
    • This is also going to require training in ethics.
  • Funding: Yeah, yeah, another humanist kvetching about the need for more funding. But I do think that this requires a different approach to grant-writing and funding for historians. To fruitfully work with web archives at scale, you probably at minimum need a beefy server or sufficient funds to spin up cloud-based storage and computing power (the latter might not be too expensive, but given the size of data and not wanting to slurp it down from an archive every time you spin up a machine, the former can get expensive quickly). To work with a 500GB web archive, you’d probably really want at least 2-3TB of storage, plus 16GB of RAM (ideally 64). Scale up from there.
  • Organizational homes for historians who use web archives: Historians and humanists who do work with web archives don’t yet have a natural home. There are a few options.
    • The International Internet Preservation Consortium (IIPC) might be evolving into the home of researchers, especially their conference (Peter Webster and I are on the program committee for the 2017 meeting in Lisbon), but the organization itself has to balance the needs of institutional members with researchers – it does not (yet?) offer individual memberships, for example (and for good reasons).
    • Computer science conferences like WebSci and JCDL are great, but the field is a different kettle of fish: without CS collaborators, the barrier for a historian to get a short or long paper accepted is steep, and it can be hard explaining to humanist tenure & promotion committees what a CS conference entails. But we can’t ask these conferences to water down their standards for us humanists either..
    • Digital humanities (the conference) might be a good venue, but history is still somewhat niche there.
    • Mainstream history is the natural home, and the AHA is very receptive to digital stuff, but for technical discussions it isn’t always the best fit.
    • What does this mean? I think we’ll probably see IIPC continue to evolve and become the natural home for this – this is one of my big goals as a IIPC PC member – but we may need a standing digital history conference to hash some of these things out..

What should we be doing that we’re not currently, in order to ensure that web archives can be accessed now and in the future? What are the barriers?

  • In-person training events: Part of this is just because we ran a Software Carpentry workshop a few weeks ago in Waterloo, but I’m a real convert. The Programming Historian is a great gateway, but bringing people together in hackathons, workshops, etc. can help people get over that first barrier. This is the concept behind Archives Unleashed.
  • Recognize barriers to funding and beginning to study them: If historians need more support for this material, it’s incumbent on us to make the case to granting councils and others. This is so nationally specific, of course. I don’t think this is asking for money, but rather finding barriers to this work and trying to fix them.
  • Greater connections between scholars and libraries: Some of my best collaborations have been with librarians – I don’t know what I’d be doing if I didn’t have the ability to research, publish, and work with Nick Ruest at York University for example (probably losing data left and right).

Talk about the most interesting project/piece of research you’ve been involved with.

  • The 2015 Canadian Federal Election on Twitter: If you wrote a letter to the editor about an election in the 1960s, there was a good chance you ended up cited in somebody’s dissertation.. we simply never had enough information about everyday people. But suddenly, the horizons are broadened and we have hundreds of thousands of unique users contributing their thoughts on Twitter. I’m really proud of the Code4Lib article that Nick and I wrote.
  • GeoCities: Seven million users, 186 million unique URLs between 1994 and 2009. I can’t imagine a bigger collection of social historical sources.. at least until the late 1990s.

It was a great event and I look forward to our next Network event. I won’t be attending our next event in October because of happy family news (my partner and I are expecting our first child in mid-October!) but look forward to joining everybody again for a day-long workshop and wrap-up in February 2017.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s