AHA Talk: The Promise of WebARChive Files

This paper was given at the American Historical Association’s annual meeting in New York City on January 5th, 2015. It was part of the Text Analysis, Visualization, and Historical Interpretation panel. My thanks to my co-presenters and especially Micki Kaufman who organized the panel.

The text that follows may not be exactly what I said, but is based on my speaking notes with a bit of memory filling in here and there.

Hello everybody, I’d like to begin with a somewhat provocative opening:

I believe that historians are unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. Web archives are going to transform the work we do for a few main reasons:

First, more data than ever before is being preserved. This is both good and bad, because we need to program to access this material – and we’re not there yet.

Secondly, it’s going to be saved and delivered to us in very different ways.

We need to understand it along the lines of both opportunity and challenge and begin to lay the foundation for historians to recognize the importance of this.

Most historians work with traditional archives, which are limited on a number of levels: physical space, the work of accession and preservation, acquisition, etc. This process has meant that historians are working with scant traces of the past – 99.9% of things that happen are never recorded (something which is not philosophically going to change). Events are transitory, they happen, and unless it ends up in an archive box, or newspaper, they are generally lost.

The problem is, with these, in a word, that of scarcity, as the late great American historian Roy Rosenzweig noted.

Yet I want to hold to you today that the arrival of something new portends to change and alter the work that we do. And it is this:

The WARC file, or WebARChive file. It lets us bundle up thousands of websites into one single file, which we can access computationally or with the Wayback Machine.

Here’s the AHA back in 1997. Well, actually the Historian’s Committee for Open Debate, which used historians.org until sometime in 2004.

So what does this mean?

Imagine a future historian, taking on a central question of social and cultural life in the middle of the first decade of the twenty-first century, such as how Canadians understood Idle No More or the Occupy movement through social media? What would her archive look like?

I like to imagine it as boxes stretching out into the distance, tapering off, without any immediate reference points to bring it into perspective.

Many of these digital archives will be without human-generated finding aids, would have perhaps no dividers or headings within “files,” and certainly no archivist with a comprehensive grasp of the collection contents.

Now this all means that historians need to get involved a bit earlier, of course. Of all those Occupy Wall Street sites – created in the heat of a movement, etc., they didn’t have data management plans. And so two years later, only 41% of those sites are still active today.

I like to throw this in because it shows that we need to actively maintain our digital sources, and historians need to begin thinking about retaining the data that we generate today.

If I wrote a book in 1991 and put it on a shelf in my basement, and came back in 2014, chances are I can read it. I’m damn sure that if I wrote a digital object, it would be unusable. We ned to be active.

But I digress. Some of it will be retained, and it’s going to be retained on a massive scale thanks to the efforts of institutions like the Internet Archive and some legal deposit institutions in Europe. Remember scarcity? We’re in the era of historical abundance.

Take Twitter, for example. During the #IdleNoMore protests, there were an astonishing 55,334 tweets on 11 January 2013. Each tweet can be up to 140 characters. Through a complicated bit of math that I whipped together, I argue that that’s over 1,800 pages if we take 300 words per page. That’s a MASSIVE book, and you’ve got one for every day of the big protest.

You can’t read that yourselves, you’re going to have to learn how to program. And that, my friends today, I hold to you is why historians need to be leading the charge up the digital humanities hill. But we’re not.

These sources present both a boon and a challenge to historians. If the norm until the digital era was to have human information vanish, “[n]ow expectations have inverted. Everything may be recorded and preserved, at least potentially.” Useful historical information is being preserved at a rate that is accelerating with every passing day.

I think James is perhaps overstating the point here. Remember the previous process I outlined to you before – event happens, some traces of the past are left behind, and historians take those little traces to write their histories.

On a philosophical level, this is still true.

But, more traces than ever before are being left behind.

Take this. A USENET post from 1995, an eleven-year old boy sparking a discussion about a then-relatively-popular-amongst-nerds board game. My first trace of an online presence. Preserved in the DejaNews archive, subsequently purchased by Google to put in their newsgroups archive. This is not the sort of source that is generally retained.

I scale that up, and have a database of over two million messages on my home computer. The biggest collection of non-commercialized public speech, that we have.

Or this – as Archive Team archivist Jason Scott has so eloquently put it in his own presentations – a GeoCities memorial garden, testament to people who died before their time, collected together in a personalized, meaningful way. In this case, I’ve decided to use a pet memorial as I feel uneasy – we’ll briefly touch on this at the end of the presentation – in showing some of this material to everybody.

Let’s not fool ourselves, this stuff matters. Things that would never have been preserved before are, and there are unparalleled opportunities for social historians to capture the broad context of the time that they are studied.
And, most importantly, I want to underscore that I don’t think you can do justice to the 1990s and beyond if you do not consider the World Wide Web. This holds true for all branches of our historical profession.

Political historians cannot do justice to elections without understanding the tweets, the blogs, the websites that surround not only elections, but the everyday process of making policy, understanding public sentiment, and reaching out to the electorate on a new level. Military historians will have the voices of rank-and-file soldiers, playing out on discussion boards and other parts of the Web. Cultural and social historians, the source base is even more evident: the impact of online and offline culture playing out, and the voices of everyday people that would not be lost.

Yes, the Web is not a perfect democracy: there are many lower-income people who still do not have access to the Web, and there are age and racial cleavages as well. We cannot forget those. But we are still expanding our percentage of ‘traces of the past’ that are preserved, and we cannot forget that either.

So there is a lot of potential here, and several upsides, as we grapple with this shift that is affecting our profession.

But there is a pitfall. And that is that, quite frankly, as I noted at the beginning of our talk: historians are not ready, as a profession, for what these changes represent.

The big pitfall is the lack of involvement that historians have done with big projects like the Google n-gram viewer, or Culturomics, project – I won’t go into detail, and some of the reluctance that we have around:

project work;
an overly specific focus on research monographs;
an occasional emphasis on looking to IT or computer science for solutions, which will create black boxes that see research going in and results coming out without us knowing how;
and a limitation of grant money.

So we have potential, but the pitfalls are that we as a profession are not completely ready for what it represents.

So what can we do?

One major source that I am using in my own work is the 80TB Wide Web Scrape. It’s an amazing resource: the entire results of the March – December 2011 scrape of the World Wide Web. I want to just quickly go through some of the work that might confront us.

This is an example of the kinds of archives that we need to use.

Assembling this dataset was not a straightforward undertaking. The WARC files are provided in a set of roughly 85,570 files, each one gigabyte in size. The total download is beyond the scope of the humanities department that I am a part of, even with Canadian federal government funding. Luckily, there is an index to the websites archived in the scrape, forming a rudimentary finding aid. These are CDX files, which are arranged in a series of lines where each line is similar to this:

ca,yorku,justlabour)/ 20110714073726 http://www.justlabour.yorku.ca/ text/html 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://www.justlabour.yorku.ca/index.php?page=toc&volume=16 – 462 880654831 WIDE-20110714062831-crawl416/WIDE-20110714070859-02373.warc.gz

For images, we use a list of recurring extensions to grab images.

My dataset ended up looking like this.

Another source that I use is GeoCities, which by the early 2000s had a whopping thirty-eight million websites. For many people, as archivist Jason Scott has eloquently argued, it represented their FIRST ACCESS TO A PUBLISHING PLATFORM – their first access to an audience.

I don’t have time to talk about methods, but in short it involves source transformations. We take our CDX files – finding aid equivalents – then select the WARC files we want. Each file is then sent through WARC-Tools and the Lynx web browser. This involves a source transformation – we lose images, background fonts, etc., but as long as we’re transparent this is the sort of unavoidable decision many historians need to make.

Problem One then is that there are no finding aids. My approach then is to use clustering. You encounter clustering all the time: if you use Google News, it will suggest similar articles; similarly, Amazon and other consumer websites, often suggest related documents. If you use DevonThink, you can do this on collections of your own documents.

To build the database, I use Solr. Solr is in a nutshell a search engine optimized for working with millions of documents, returning results in milliseconds on a personal computer.

I then make it speak to a program called Carrot2, which clusters search results from Google, Bing.. and yes, custom databases set up in Solr!

In my other work, I am a historian of youth cultures; how could this methodology help somebody with my research interests? Here, a query for ‘children’.

At a glance then, the visualization presents itself as a finding aid: we learn what these WARC files tell us about ‘Children,’ and whether this is worth investigating further. In this case study, we see that we have files relating to children’s health (ranging from Health Canada to Tylenol), research into children at various universities, health and wellness services, as well as related topics such as Youth Services, Family Services, and even mental health. Thus, we have both an ad-hoc finding aid equivalent as well as a way to move beyond distant and close reading levels.

Clusters often contain more than one file, and, much like a finding aid and archival box, the relationship between clusters can help shed light on the overall structure of a document collection. This positions itself about half way between the idea of a finding aid and exploratory data visualization. Consider the following image generated via the Aduna visualization built into Carrot2:

In the above visualization, the labels represent clusters. If a document spans two or more clusters, it is represented by a dot connected to both labels, which represent clusters. For example, “Christian Education” appears in the middle left of the chart. There is one document to the left of it (partially covered by the label), a document that belongs only to that label. Yet there is one to the right of it that is connected with “Early Learning,” representing a website that falls into both categories.

From this, we can learn quite a bit about the files that we can find in the Wide Web Scrape as well as suggest which might be most fruitful for exploration. In this chart, at the bottom we see the websites relating to children’s health, which connect to breastfeeding, which connect to timeframes, which actually then connect to employment (which often contains quite a bit of data and time information). We then also see that connect to early childhood workers, which in turn connects to early learning more generally. The structure of the web archive relating to children reveals itself. At a quick look, instead of going through the sheer amount of text, we are beginning to see what information we can learn from this web archive.

And once we find a relevant document – such as this one on Christian Education – we can click on it and be brought back to the original in the Internet Archive.

The problem is that we need to know in this approach what we are looking for.

Named-entity extraction, or NER, is one promising and relatively simple concept that historians can use to explore website archives. In short, an NER program reads text and marks it up to identify people, organizations, locations, and other categories. For web archives, we can run it on the entire corpus to gain a macroscopic view of various topics being discussed. I will begin by using Canadian (.ca) examples, before moving into a related example of .edu sites. Consider the following visualization of countries – other than Canada – discussed in the Wide Web Scrape:

[I then walked them through a series of slides on NER – this was accomplished by using Stanford NER, taking the results into the Google Maps API, extracting place names, and seeing what we found. The big goal will be to do this on a longitudinal data set so you’d see changes over time.]

Or we can sit back and look at the humanity that generated this. There are ethical implications we need to consider:

these people did not know that their material was being archived;
to opt out of web archives requires access to the robots.txt file – most people do not know that it exists;
we do not have the traditional framework of donor agreements, etc. to fall back on.

Now, of course, none of this means that we shouldn’t do this sort of work. However, we need to begin thinking about it or else other people will do thinking about this for us. My own approach boils down to principles developed by people who work with the Association of Internet Researchers: considering expectation of privacy, but also the scope of my inquiry. I distant read thousands of websites. Individually reading a single website written by a twelve year old child, for example, requires more attention and care.

I want to finish with four main conclusions:

This is coming!
We’re not ready.
We can do some great things.
But if we don’t get ready, other disciplines will do those great things. Let’s make sure history, as a profession, is ready as we enter the era of web archives.

Thanks very much!

Ian Milligan

AHA Talk: The Promise of WebARChive Files

One thought on “AHA Talk: The Promise of WebARChive Files”

Leave a reply to The History Manifesto and Big Data | Stumbling Through the Past Cancel reply

Share this:

Related

One thought on “AHA Talk: The Promise of WebARChive Files”

Leave a reply to The History Manifesto and Big Data | Stumbling Through the Past Cancel reply