Adapting the Web Archive Analysis Workshop to Longitudinal Gephi: Unify.py

The Problem

This script helps us get longitudinal link analysis working in Gephi

This script helps us get longitudinal link analysis working in Gephi

When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. liberal.ca could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.

Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:

[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]

File A: host-id-2006-canadian-political.tsv

It will look like:

24	{(24),(6047)}
57	{(3831)}
60	{(60),(281),(356),(1931),(2545),(3066),(3068),(3095),(3719),(3818),(5270),(5308),(5309),(5701),(5785),(5847),(6337),(6338),(6339),(6340),(6521),(6536),(6548)}
72	{(3831)}
109	{(109),(110),(2172),(2254),(2778)}
110	{(110),(2172),(2254),(2778)}
126	{(13),(78),(113),(126),(128),(185),(259),(310),(391),(568),(687),(738),(770),(780),(781),(793),(825),(830),(836),(845),(847),(893),(900),(923),(931),(975),(988),(990),(1004),(1073),(1224),(1317),(1319),(1326),(1386),(1409),(1539),(1548),(1551),(1556),(1557),(1558),(1693),(1696),(1735),(1869),(1876),(2180),(2239),(2260),(2261),(2286),(2299),(2348),(2447),(2493),(2515),(2519),(2639),(2701),(2773),(2775),(2805),(2806),(2837),(2874),(2881),(2902),(2929),(3085),(3091),(3121),(3132),(3559),(3561),(3751),(3754),(3802),(4018),(4171),(4192),(4194),(4195),(4198),(4226),(4264),(4270),(4297),(4366),(4371),(4435),(4543),(4544),(4547),(4548),(4553),(4554),(4565),(4567),(4607),(4647),(4751),(4769),(4981),(4992),(5007),(5014),(5017),(5040),(5069),(5119),(5155),(5163),(5207),(5303),(5314),(5344),(5392),(5418),(5553),(5560),(5561),(5562),(5563),(5648),(5649),(5720),(5752),(5772),(5941),(5952),(5999),(6030),(6071),(6141),(6170),(6191),(6214),(6280),(6284),(6310),(6374),(6379),(6440),(6461),(6466),(6475),(6526),(6606),(6608)}

File B: id-map-2006-canadian-political.tsv

110	acf.hhs.gov
111	acfas.ca
112	acfc.org
113	achannel.ca
114	achilles.net
115	acia.uaf.edu
116	acic-caci.org
117	acjnet.org
118	aclc.net
119	aclrc.com
120	aclu.org
121	acme-eau.com
122	acme-eau.org
123	acmo.org
124	acog.org
125	acornorganic.org
126	acp-cpa.ca

You need to combine the two, so we have a long list of nodes like

acp-cpa.ca     acmo.org

The Solution

The real solution is to have an amazingly talented doctoral candidate working with you, who can do this in his or her sleep.

Jeremy Wiebe, who’s been working with me over the last few months, wrote this program that will combine File A and File B to produce Gephi readable format. By using the unique host names rather than numbers, it eases the ability to merge multiple datasets and get your timeline slider in Gephi working effortlessly.

It’s unify.py, and you can use it like so:

Usage: unify.py [id-map-file.tsv] [graph-file.tsv]
   OR
       unify.py [id-map-file.tsv] [graph-file.tsv] -o [output-file.tsv]

Or in this case,

unify.py id-map-2006-canadian-political.tsv host-id-2006-canadian-political.tsv -o unified-2006-canadian-political.tsv

At some point in the near future, I’m hoping to get a good, organized, and presentable project GitHub up with all our scripts and tools. But in the short term, code is here:

#!/usr/bin/env python

import csv
import sys

if len(sys.argv) <= 2:
	print "Usage: %s <id-map-file.tsv> <graph-file.tsv>" % sys.argv[0]
	print "   OR"
	print "       %s <id-map-file.tsv> <graph-file.tsv> -o <output-file.tsv>" % sys.argv[0]
	sys.exit(1)

with open(sys.argv[1], 'rb') as keyfile, open(sys.argv[2], 'rb') as graphfile:
	keyfile = csv.reader(keyfile, delimiter='\t')
	graphfile = csv.reader(graphfile, delimiter='\t')

	if (len(sys.argv) == 5) and (sys.argv[3] == '-o'):
		tsvout = open(sys.argv[4], 'wb')
	else:
		tsvout = sys.stdout
	tsvout = csv.writer(tsvout, delimiter='\t')

	# Read values of id-map-file.tsv into dictionary 'map'.
	# Assumes sane data.
	map = dict()
	for row in keyfile:
		map[row[0]] = row[1]

	# Rewrite contents of graphfile using dictionary values.
	for row in graphfile:
		adj_list = row[1].translate(None, '{}()')
		adj_list = adj_list.split(',')
		for node in adj_list:
			tsvout.writerow([map[row[0]], map[node]])

This script will literally save you hours of time bashing your head against Gephi. Good luck out there!

2 thoughts on “Adapting the Web Archive Analysis Workshop to Longitudinal Gephi: Unify.py

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s