Exploding ARC Files with Warcbase

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it?

Right now, as we ingest material into an HBase database, we can query it with an API as well as access it using OpenWayback. Once you have the material into warcbase, you can do some neat things with it. This example uses some ARC data taken from CommonCrawl’s early scrape in 2008/2009 ().

Yesterday, I was playing with link and anchor text extraction. What this would do is the following: say you have a post on the website ActiveHistory.ca (http://activehistory.ca/page1) that’s linking to Library and Archives Canada’s English splash page (http://www.bac-lac.gc.ca/eng/Pages/home.aspx). The link looks like this:

we’ve been disheartened by recent cuts to Library and Archives Canada

The actual HTML for this is:

<a href="http://www.bac-lac.gc.ca/eng/Pages/home.aspx">we've been disheartened by recent cuts to Library and Archives Canada</a>

What we can do with our data in warcbase is extract all of the information from this link, the:

source: http://activehistory.ca/page1
target: http://www.bac-lac.gc.ca/eng/Pages/home.aspx
anchor text: we’ve been disheartened by recent cuts to Library and Archives Canada

The provided script below generates this:

register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

DEFINE ArcLoader org.warcbase.pig.ArcLoader();
DEFINE ExtractLinks org.warcbase.pig.piggybank.ExtractLinks();

raw = load '/Users/ianmilligan1/arc-dir' using ArcLoader as (url: chararray, date: chararray, mime: chararray, content: bytearray);

a = filter raw by mime == 'text/html';
b = foreach a generate url, FLATTEN(ExtractLinks((chararray) content));

store b into '/Users/ianmilligan1/pigout/';

And we get data like this, in tab-separated value format:

Screen Shot 2014-12-09 at 10.31.54 AM

That is

target source anchor-text

We can then load this data into Gephi (remember, if you’re on OS X Mavericks or Yosemite, some more steps might be needed).

The first step is to open the generated file, say part-m-00000 and make two quick changes:
* add on the first line ‘source,’ then tab, then ‘target,’ and then tab, and then ‘label’
* save the file as ‘data00000.tsv’ or something like that

Open up Gephi, start a new project, click on ‘Import Spreadsheet.’

Screen Shot 2014-12-09 at 10.35.08 AM

Select the file in the ‘CSV’ slot, select ‘tab’ as separator, import it as an ‘Edges table’ and you’ll see the preview below. If you’re loading in lots of links, you’ll need to increase the memory allocated to Gephi from the paltry half-Gig it begins with (edit the /applications/gephi.app/contents/resources/gephi/etc/gephi.conf file and change the -J-Xmx512m to something more appropriate – say -J-Xmx5g for 5GB).

Click ‘Next >’ and ‘Finish.’ Just to get the labels in the label slot, in case you need them, click ‘Copy data to other column, and copy the ‘ID’ value to ‘Label.’

When you click on the ‘Overview’ tab you’ll see something crazy like this:

Screen Shot 2014-12-09 at 10.39.40 AM

This is an awesome Borg cube, but not too interesting. Luckily, we can run a layout (thanks to our forthcoming book on the Historian’s Macroscope, you can learn all this too). Check ‘ForceAtlas 2’ under Layout, and then hit ‘Run.’ Depending on the size of your corpus and the speed of your computer, this might be a bit jerky. Let it run for a bit and eventually the nodes will separate.

It begins to explode!

It begins to explode!

Soon you’re looking at individual websites, floating by your screen.

Screen Shot 2014-12-08 at 5.25.25 PM

This data isn’t too useful, right now, but a forthcoming longitudinal corpus of much smaller and focused websites that I’m soon going to be working with will be really useful. Imagine seeing how structures in certain groups have changed over the last ten years? And the labels can be legible.

But for the time being, we’ve got a neat workflow that takes an ARC (soon WARC) file, loads it into a database, quickly extracts the links (we’re talking less than a minute for 224,977 URLs) and lets us visualize it.

Finished product. If you wanted to zoom in, you could see all the labels. But really, not so useful due to the unfocused size of this dataset.

Finished product. If you wanted to zoom in, you could see all the labels. But really, not so useful due to the unfocused size of this dataset.

One thought on “Exploding ARC Files with Warcbase

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s