“Mining the Internet Graveyard: Exploring Canada’s Digital Collections Projects” – Presentation at the Canadian Historical Association

This is my presentation from the 2012 Canadian Historical Association Annual Meeting. As with my previous presentation, there are parts of it presented using Mathematica that I cannot represent here. I’ve provided static images, however, and am happy to provide examples through Skype or elsewhere.

My spoken text deviates slightly from what is below, as I occasionally ad-lib a bit, but you should get a sense of my argument here.



Hello everybody, in today’s talk I want to tackle three things. Firstly, I want to outline a serious problem that I see confronting today and tomorrow’s social and cultural historians. Secondly, I want to propose a way forward with computational history, through a case study – that of “Canada’s Digital Collections Projects” to help make some of my abstract arguments a bit more concrete.

What is the problem?

I want to open with a quotation from James Gleick and his 2011 book “The Information: A History, a Theory, a Flood.” He writes:

“The information produced and consumed by humankind used to vanish – that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Marks on stone, parchment, and paper were the special case. …. Now expectations have inverted. Everything may be recorded and preserved, at least potentially.”

So when will this flood become “history”? I think part of the problem lies in determining when ‘ history’ becomes history, with a capital H. In my own field, the sixties, if we took the pivotal year of 1968 we could see that the first scholarly monographs appear under 20 years later in the early 1980s, and by 1996 certainly there is a growing Canadian historiography, not even thirty years later.

In 1991, Sir Tim Berners-Lee published the first website. That was 20 years ago, and in another ten years, historians will certainly be entering the ‘web’ era of history.

Furthermore, LAC has stopped acquiring materials, and budget cuts to local and provincial archives have imperiled their acquisition budgets as well. When it comes to write the history of the 1990s or the 2000s, we won’t have traditional archives. If we’re not ready to deal with this big history repository, there’s a chance that somebody else will do it. Let’s make sure historians are at the table.

Are we ready??

Some numbers can help bring the problem into relief.

If we take the Library of Congress, the largest repository of information in the world, and were to scan every book at about 8MB a book, we would have a dataset of 200TB. If in 1949, the father of information theory, Claude Shannon, saw the Library of Congress as the very largest stockpile of information he could conceive of, I could walk into Best Buy, drop $1,200 dollars – even after the rise in prices after the flooding in Thailand – and store it all at home.

Don’t do this.

The Library of Congress shrinks in comparison to the Wayback Machine, which is part of the Internet Archive, and collects webpage information. It is currently 3PB and is growing by 12TB/month.

To put this into relief, a PB is 1,000 TB. The Library of Congress is approximately 0.2 PB, or 6% of the Wayback Machine. And approximately 6% of the Library of Congress is added EVERY MONTH to the Wayback Machine. This is going to be an astounding historical source to think about.

And it isn’t just born-digital sources, either. Google, in its “moon shot” has set the audacious goal of digitizing all accessible books in the world by 2020.

Canada is a bit of a laggard, but Library and Archives Canada now has a 7TB web archive – largely unknown – of which 4TB is archived government web pages.

Here we can see it in bar chart form – if you squint, you might just make out LAC’s collections:

Now, what can we do with all of that data?

Now, the first project that many people think of – and something that piqued my interest – was the Culturomics or the Google Ngram project, run out of the Harvard Cultural Observatory and which made a splash with a high-profile article in Science. They essentially took five percent of the books ever published in human history, analyzed them for word and phrase frequency, and made quantitative claims about understanding history. Thirteen authors were on the project: from English literature, from psychology, from computer science, biology, and mathematics.

Where were the historians?

Responding to a criticism on that account by Anthony Grafton, the president of the American Historical Association, the two principal authors Jean-Baptiste Michel and Erez Lieberman Aiden were clear and succinct:

“The historians who came to the meeting were intelligent, kind, and encouraging. But they didn’t seem to have a good sense of how to wield quantitative data to answer questions, didn’t have relevant computational skills, and didn’t seem to have the time to dedicate to a big multi-author collaboration.
It’s not their fault: these things don’t appear to be taught or encouraged in history departments right now.”

Now, I don’t think Culturomics is the be all and end all by any means – it doesn’t have contextualization options, makes some overreaching claims about its scope to rethink the entire discipline, etc. But historians have to be at the table if we’re going to engage on a fair, even playing ground.

Now, I want to provide a brief example of one born-digital issue of too much information, and some ways that I have been trying to quickly tackle it. I hope for the concrete thinkers out there, this can help anchor some of the discussion about ‘the problem’ and the ‘solution’.

A collection that I have been practicing and learning with is the ‘Internet Graveyard,’ which was a selection of over 500 websites made by young Canadians between 1996 and 2004 before funding was cut and the program was archived at Library and Archives Canada – one of Canada’s first born-digital archival collections.

How much information is there, sitting half-dead at LAC? Well, it’s a lot. 7,805,135 words, 49 million characters. That might not sound like too much, but it’s spread across 78,585 HTML files. It’s 7.26GB, 360MB plaintext, and if you remove all HTML tags it’s a mere 50MB of unstructured data.

It’s worth noting that these are files that make regular text editors hang.

I imagine this as going into an archive. These are archives that are akin to the Indiana Jones closing scene: boxes stretching as far as the eye can see, no finding aids, no way to even tell what’s in the box necessarily (beyond an IP address or a cryptic title). The tools I’m currently playing with let you look inside the box, to quickly find relevant information and explore the structure of a website.

Now once you have it all on your home computer, you can start crunching away. One thing I want to highlight in the short time remaining is how we can move this from a dead collection to a set of dynamic finding aids. By crunching word and phrase frequency, we can quickly get a sense of what it is all about.

Why should we not use Google for all our searches? Google is great, and very helpful. It doesn’t always work on massive born-digital archives, however, and more importantly it’s a proprietary black-box. Others have noted that Google already shapes our professional research inquiries to a disproportionate degree, and it’s a tool that many of us use without even thinking about the possible error that is creeping in.

So let’s start our own searching, exploring, and crunching. To do so, we’ll need to find ways to visualize all this textual information – we can’t sit down and read every line.

So the gateway drug to this, for many, is a Wordle. I give this example not because it’s the best visualization technique – it has no ability to boil down into context, or get a sense of much beyond how often a word appears – but because for many it’s the most basic visualization they know. And, at a very broad sense, it can turn a random file – in this case “pastandpresent” into something somewhat distinguishable.

AT THIS POINT, I SWITCH TO MATHEMATICA AND RUN THROUGH THE PROGRAM DYNAMICALLY WITH AUDIENCE INPUT.

This case study is intended to provide one model for distant reading and the generation of finding aids, in the interest of showing some of the potential for working with born-digital sources. It is intended to begin or jumpstart a conversation around the resources and skills necessary for the next generation. A seemingly insurmountable amount of information, contained across over 73 thousand files, can be quickly navigated and isolated for researcher convenience.

To conclude: there is a problem. The Internet is 20 years old, and before historians realize it, we will have to develop the critical infrastructure to deal with this.

So my closing message. If history is to continue as the leading discipline in understanding the social and cultural past, decisive movement is needed.

When the sea comes in, we must be ready to swim. Thank you very much!

3 thoughts on ““Mining the Internet Graveyard: Exploring Canada’s Digital Collections Projects” – Presentation at the Canadian Historical Association

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s