From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough

Screen Shot 2015-12-10 at 4.20.20 PM

Do you want to make this link graph yourself from our data? Read on.

As part of our commitment to open data – in keeping with the spirit and principles of our funding agency, as well as our core scholarly principles – our research team is beginning to release derivative data. The first dataset that we are releasing is the Canadian Political Parties and Interest Groups link graph data, available in our Scholars Portal Dataverse space.

The first file is all-links-cpp-link.graphml, a file consisting of all of the links between websites found in the our collection. It was generated using warcbase’s function that extracts links and writes them to a graph network file, documented here. The exact script used can be found here.

However, releasing data is only useful if we show people how they can use it. So! Here you go.

Video Walkthrough

This video walkthrough is best viewed in conjunction with the step-by-step walkthrough below.

Step-by-Step Walkthrough

Once you’ve downloaded the file, open up Gephi.

On the opening screen, you want to select “Open a Graph File…” and select the all-links-cpp-link.graphml file that you downloaded from our Dataverse page. Continue reading

Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.

Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities

I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell). 

If you want to see the poster, please click here.

Milligan_LT_Big_Data

Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.

GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.

These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Continue reading

Adapting the Web Archive Analysis Workshop to Longitudinal Gephi: Unify.py

The Problem

This script helps us get longitudinal link analysis working in Gephi

This script helps us get longitudinal link analysis working in Gephi

When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. liberal.ca could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.

Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:

[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]

Continue reading

Exploding ARC Files with Warcbase

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it? Continue reading