… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.
Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.
What can we do with it?
Right now, as we ingest material into an HBase database, we can query it with an API as well as access it using OpenWayback. Once you have the material into warcbase, you can do some neat things with it. This example uses some ARC data taken from CommonCrawl’s early scrape in 2008/2009 ().
Yesterday, I was playing with link and anchor text extraction. What this would do is the following: say you have a post on the website ActiveHistory.ca (http://activehistory.ca/page1) that’s linking to Library and Archives Canada’s English splash page (http://www.bac-lac.gc.ca/eng/Pages/home.aspx). The link looks like this:
The actual HTML for this is:
<a href="http://www.bac-lac.gc.ca/eng/Pages/home.aspx">we've been disheartened by recent cuts to Library and Archives Canada</a>
What we can do with our data in warcbase is extract all of the information from this link, the:
The provided script below generates this:
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; DEFINE ArcLoader org.warcbase.pig.ArcLoader(); DEFINE ExtractLinks org.warcbase.pig.piggybank.ExtractLinks(); raw = load '/Users/ianmilligan1/arc-dir' using ArcLoader as (url: chararray, date: chararray, mime: chararray, content: bytearray); a = filter raw by mime == 'text/html'; b = foreach a generate url, FLATTEN(ExtractLinks((chararray) content)); store b into '/Users/ianmilligan1/pigout/';
And we get data like this, in tab-separated value format:
target source anchor-text
We can then load this data into Gephi (remember, if you’re on OS X Mavericks or Yosemite, some more steps might be needed).
The first step is to open the generated file, say
part-m-00000 and make two quick changes:
* add on the first line ‘source,’ then tab, then ‘target,’ and then tab, and then ‘label’
* save the file as ‘data00000.tsv’ or something like that
Open up Gephi, start a new project, click on ‘Import Spreadsheet.’
Select the file in the ‘CSV’ slot, select ‘tab’ as separator, import it as an ‘Edges table’ and you’ll see the preview below. If you’re loading in lots of links, you’ll need to increase the memory allocated to Gephi from the paltry half-Gig it begins with (edit the /applications/gephi.app/contents/resources/gephi/etc/gephi.conf file and change the
-J-Xmx512m to something more appropriate – say
-J-Xmx5g for 5GB).
Click ‘Next >’ and ‘Finish.’ Just to get the labels in the label slot, in case you need them, click ‘Copy data to other column, and copy the ‘ID’ value to ‘Label.’
When you click on the ‘Overview’ tab you’ll see something crazy like this:
This is an awesome Borg cube, but not too interesting. Luckily, we can run a layout (thanks to our forthcoming book on the Historian’s Macroscope, you can learn all this too). Check ‘ForceAtlas 2’ under Layout, and then hit ‘Run.’ Depending on the size of your corpus and the speed of your computer, this might be a bit jerky. Let it run for a bit and eventually the nodes will separate.
Soon you’re looking at individual websites, floating by your screen.
This data isn’t too useful, right now, but a forthcoming longitudinal corpus of much smaller and focused websites that I’m soon going to be working with will be really useful. Imagine seeing how structures in certain groups have changed over the last ten years? And the labels can be legible.
But for the time being, we’ve got a neat workflow that takes an ARC (soon WARC) file, loads it into a database, quickly extracts the links (we’re talking less than a minute for 224,977 URLs) and lets us visualize it.