Herrenhausen Lightning Talk on Historians and Web Archives

I had three minutes to give a talk to a diverse audience, so this is my crack at it below. The audience could then follow up with us at our poster, which I’ll also put online. It was an incredible conference: wonderful to meet people from so many countries, continents, and universities, and the Volkswagen Foundation was a most generous host.

I want to open with a provocative statement: Historians (at least in North America) are generally unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. They have been underrepresented in large digital endeavours, such as the Harvard Cultural Observatory’s Culturomics project, ceding disciplinary ground to computer scientists, evolutionary biologists, and English literature scholars. Historians, however, bring an important perspective to the table: professional training in historiography, expertise in evaluating primary sources, experience in balancing various scales of time, and an awareness of decades of professional development in social, cultural, and political histories. If history is to continue as the leading discipline in understanding the social and cultural past, decisive movement towards the digital is necessary. Every day most people generate born-digital information that if held in a traditional archive would form a sea of boxes, folders, and unstructured data. We need to be ready.

An important source for historians is the Internet Archive. This organization began a web archiving project in 1996 on a non-discriminate basis: it crawls the World Wide Web and takes snapshots of what it finds, preserving them in a standardized file format known as a WARC file. Unfortunately, these files are currently only accessible to those with specialized technical knowledge and fluency with a command line interface. WARC files may be the archival boxes of the future, but they are currently locked away from end users and relatively unknown. My research will rectify that problem through a case study of Canadian WARC files held at the Archive.

Now an example can help explain the scope of the problem.

If we were to go into the Library of Congress and digitize each book (using an average of eight megabytes per book), we would have a dataset of 200 terabytes. In contrast, the Internet Archive now has ten petabytes of information saved. Consider the graph at right: it is not a mistake that the Library of Congress is barely visible.

My research is still new, but is focused on several key questions:

  • What do historians need to know to access web archives? Can we make it easier?
  • What have other scholars studied?
  • What tools can we use?

Initially, I have been using Mathematica, SOLR and Stanford Named-Entity-Recognition to approach the first vexing problem: the lack of finding aids. For each file, I generate previews: word clouds, leading people, organizations, and locations, as well as extracted topics. I also use a search engine that I can understand, and cluster the results with the Carrot2 clustering workbench. We can then move quickly from the macro-level of a 5% tranche of the Canadian World Wide Web and down to the level of the archived web document, within the context of much broader information.

My hope is that this will both help historians find the information they might need, but also – and more crucially – start a conversation amongst humanists. The goal is to bundle it all together as a suite of tools for historians known as history crawler.

Look forward to chatting with you at my poster, and I also have draft papers available.

Thank you very much.

Thanks to @Katharina_NL for taking the picture of me in action!
