Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.
I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell).
Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.
GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.
When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. liberal.ca could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.
Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:
[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]
With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:
… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.
Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.