Guided Tour of the Canadian Political Party Web Archive

What on earth is this? Spaghetti and meatballs? Turns out it's full of useful information when we pry it apart.

What on earth is this? Spaghetti and meatballs? Turns out it’s full of useful information when we pry it apart.

Those who follow me on Twitter know that I’ve been playing with the WAT file format, and in particular have been undertaking a crash course in Gephi. It’s been really rewarding! It’s already led me to dig into the Wayback Machine and find out that the No Shari’a law campaign transitioned to an anti-Iranian embassy campaign around the same time that it became less relevant within the link structure of the Canadian political sphere.

But in short, this stuff matters because it takes us from a list of files that are opaque and hard to deal with – WARC and WAT files – and into something that we can now work with, and begin to ask research questions about.

If you’ve got a fantastic campaign, you can try playing with the PDF export of this. I could make this prettier but I think there’s a limited audience for this.

My sincerest thanks to Micki Kaufman – her help (and willingness to walk me through some of the data wrangling on a screenshare) made this possible.

Some preliminary thoughts on the WAT file below and what we can learn from this sort of analysis. I’m not a YouTube personality, and Gephi can be occasionally persnickety – especially when one foolishly tries to work remotely in the morning on their laptop – but I think it showcases some of the possibilities. More soon.

Using Gephi to Explore Web Archive Structures

The evolution of inbound links, Canadian political movements, 2005-2014.

The evolution of inbound links, Canadian political movements, 2005-2014.

In my last post, I discussed how I could take WAT files and extract Gephi graphs from them. In this post, I want to show how I’ve moved past that and am now working with dynamic Gephi graphs. Some of the fruits of this can be seen in the animated GIF at left. It’s been two days of learning, which is among my favourite things to do! Again, this is super preliminary: mostly just liveblogging some of the questions that are popping into my head from these files..

To generate these, I did the following (mostly following the Import Dynamic Data tutorial, as well as getting helpful hints from Micki Kaufman):

  • Drew on a Gephi file generated for each year of these collections (thanks to my RA, Jeremy Wiebe);
  • Implemented a number of tweaks: ran a modularity detection algorithm and coloured clusters accordingly, played with the filter for ‘Topology -> In Degree Range’ and generated versions for 2 and 4 limits of in-bound links, and made sure to extract these to a new workbench;
  • Each workbench was then exported as a GEFX file;
  • I then started a new project, and opened each GEFX file in turn: making sure to select ‘time series’ and filling out the box asking for the date (I used the year value).

The results have been illuminating, although analysis is obviously still to come. I wish there was a way to export the ‘chart’ results that one can generate in the ‘ranking’ section of the workbench, but apparently this doesn’t exist. Continue reading

Accessing Historical Data En Masse

Click to download the slide deck.

Click to download the slide deck.

Today I’m giving a workshop at the Massachusetts Institute of Technology on “Accessing Historical Data en Masse.” Slides, links, and more details are available on this standalone page. It’s part of a broader seminar on “Research, Teaching, and Digital Humanities” held as part of their World History Seminar.

Hopefully others find it useful. While the slides aren’t a perfect substitute for having me go through the examples – I hope – it’s being videotaped. Perhaps we can make that available?

WAT Files to Gephi Graphs

I’ve been playing with a large collection of WAT files. WAT files, specified here, are a collection of web archived metadata records. They’re a lot easier to move around and deal with than the full WARC files: i.e. they’re on the scale that you can put a few on a thumb drive, or years upon years of collections on a standard portable hard drive, rather than the storage arrays you need to deal with oodles of WARCs.

There was a bit of a learning curve for me, however, so I wanted to share the steps that I took to take a WAT file and generate a link visualization.

I’m not really going to dwell on results, because I literally generated these this morning. I plan to get longitudinal data between 2005 and 2014. I’m not convinced the networks themselves are going to be key, but the reports I can generate in¬†Mathematica and Gephi should be useful.

And in any case, maybe this’ll help you. Continue reading

“Text Analysis, Visualization, and Historical Interpretation” and the Digital Drop-In Room

Screen Shot 2014-12-22 at 10.34.27 AMOn Monday, January 5th in the 11AM – 1PM slot at the American Historical Association’s¬†annual meeting, I’m presenting on a panel entitled “Text Analysis, Visualization, and Historical Interpretation.” Rather than plug my individual paper – the abstract is here – I want to plug the whole panel. It’s an awesome lineup, full of people who I’m looking forward to meeting in real life for the first time. If you’re at the conference and looking for something to do on Monday mid-day, come on by Murray Hill Suite A (New York Hilton, Second Floor)! Details quoted below.

I’ll also be participating in the “Digital Drop-in Room” on Sunday, January 4th between 2:30 and 4:30 PM, in the Liberty Suite 3 (Sheraton New York, Third Floor). There’s an awesome lineup of historians who have signed up to talk, chat, and help folks with digital projects. If you’re around on Sunday, come say hi.

Anyways, I’m really looking forward to connecting with many folks who I only know by e-mail, from Twitter, by reputation, or from their work. Continue reading

Playing with the Web Archive Analysis Workshop: A Few Tips for Fellow Tinkerers

Yesterday, I spent part of the day tinkering with the Web Archive Analysis Workshop, put together by the Internet Archive’s Vinay Goel. It’s a great workshop that really lets us play and extract tons of useful data from WARC files. In this post, I won’t rehash the commands, but want to show a few things that I needed to do to get it up and running, and then show some of the potential.

This is mostly just a research notebook for anybody else who stumbles down the path after me.

Getting it Started

The instructions for installation are pretty straightforward. I think I just ran brew install hadoop to get it all working. On my version of OS X, I received an error message when running several of the scripts:

ERROR 2998: Unhandled internal error. org/python/core/PyObject

I was missing a standalone Jython jar in my pig installation. To fix this, I downloaded Jython from here, generated a standalone jar and copied it to my pig install like so:

java -jar jython-installer-2.5.3.jar -s -d /tmp/jython-install -t standalone -j $JAVA_HOME
cp /tmp/jython-install/jython.jar /usr/local/Cellar/pig/0.13.0/lib 

It let me get it all working. For convenience, so I could run this again quickly after rebooting things, I created a script with these commands which I’d run in the directory with my content (in my case, /users/ianmilligan1/internet-research/).

export PROJECT_DIR=`pwd`/archive-analysis/
export DATA_DIR=`pwd`/sample-dataset/
cd $PROJECT_DIR
export PIG_HOME=/usr/local/Cellar/pig/0.13.0

Running the Code and Analyzing Results with Bash and Excel: A Few Tips

It all should run quite well. The script is pretty straightforward. Output, like other Pig programs that I’ve run in the past, is generated into the ~/internet-research/sample-dataset/derived-data directory into subfolders as partial files: i.e. part-00000 or part-00000.gz. Continue reading

Exploding ARC Files with Warcbase

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it? Continue reading