WAT Files to Gephi Graphs

I’ve been playing with a large collection of WAT files. WAT files, specified here, are a collection of web archived metadata records. They’re a lot easier to move around and deal with than the full WARC files: i.e. they’re on the scale that you can put a few on a thumb drive, or years upon years of collections on a standard portable hard drive, rather than the storage arrays you need to deal with oodles of WARCs.

There was a bit of a learning curve for me, however, so I wanted to share the steps that I took to take a WAT file and generate a link visualization.

I’m not really going to dwell on results, because I literally generated these this morning. I plan to get longitudinal data between 2005 and 2014. I’m not convinced the networks themselves are going to be key, but the reports I can generate in Mathematica and Gephi should be useful.

And in any case, maybe this’ll help you.

Step One: Using Web Archive Analysis Workshop to Extract Data from WAT Files

Much of this followed the examples set out in the Web Archive Analysis Workshop, specifically:

Extracting links

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Creating a Host Graph

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig

And then creating link graph data:

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

Once we have this data, here’s what I did.

Step Two: Taking Data into Mathematica (you could adapt) to generate source/target information

In Mathematica, I ran the following code:

data = Import[
graph.tsv", "Data"];

results = Reap[
      line = data[[x]];
      Thread[ToExpression[line[[1]]] -> ToExpression[line[[2]]]]];
    , {x, Range[1, Length[data], 1]}]

graphdata = results[[2]][[1]];

This code basically imports our host-id-sorted-int data, and then breaks it apart so that it’s a long list of HOST –> DESTINATION format. Once it’s in Mathematica, we can use its own graphing/networking suite, or we can throw it into Gephi.

The results are akin to:
Screen Shot 2015-01-09 at 5.55.27 PM

Step Three: Wrangle into Gephi

I prefer using Gephi, so I export it using:

 Flatten[graphdata], "Pajek"];

I open the file into Gephi. The problem is that the labels aren’t there by default (as seen above, we’re dealing with numerical representations). To get labels, I do the following:

  1. export the nodes and the edge spreadsheets using the export function in the data library;
  2. I open up the following file: ../derived-data/graph/id.map/part-m-00000 which has a list of what each number represents;
  3. I paste the list of URLs into the ‘label’ list of the node spreadsheet.
  4. I then re-import them back into the program using the Data Laboratory, and begin to visualize it. It’ll look like this for NODES:

Screen Shot 2015-01-09 at 6.03.57 PM

And this for EDGES:

Screen Shot 2015-01-09 at 6.04.30 PM

In Gephi, I do the following. I’m not an expert, by any means, but turn to our trusty pre-release copy of the Historian’s Macroscope for help. In the lower right, under ‘layout,’ I select ‘Force Atlas 2’ and let it run. It takes a while, so I sometimes raise the scaling up to 200 just to get things going.

Screen Shot 2015-01-09 at 6.03.30 PM

We then need to look for the data we want to find. In this case, I want to find websites that have the most inbound links – the most inbound central. I set up a ranking in the upper left like so:

Screen Shot 2015-01-09 at 6.05.57 PM

And then decide to filter using the filters in the right panel, like so:

Screen Shot 2015-01-09 at 6.05.09 PM

With a filter, I ‘select’ it, and then click on the visualization pane at the bottom of the window (there’s a little triangle that will bring it up), and select “Hide node/edge labels if not filtered” on the label tab.

Finally, in the ‘preview’ tab, I select ‘show labels,’ make it size 12, proportional to the links, and begin to export. Here’s some results – Canadian labour movement WETs in 2009 [download full PDF here]:

Screen Shot 2015-01-09 at 6.08.40 PM Here we see in December 2009 that the central in-bound links were to thinks like: CanadianLabour.ca, the Canadian Labour Congress, the Canadian Union of Public Employees, PolicyAlternatives.ca, as well as Facebook, YouTube, the Flash viewer, and Adobe Acrobat readers.

By September 2014, the story is different [download full PDF here]:

Screen Shot 2015-01-09 at 6.11.01 PM


Screen Shot 2015-01-09 at 6.13.15 PM

We instead see other new hubs: YouTube, Facebook, Twitter, LinkedIn, etc. but also hubs like parliamentary webpages, David Suzuki (environmental concern?). But alas, hunger called, workshops needed to be prepped. Will look at this more later next week.


Posted In:

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s