If you’ve been following me on Twitter, you know that I’ve been playing around with warcbase, part of my overall exploration into web archives. Warcbase is an “open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge.” While I’m not terribly fluent at these platforms – getting better at them is one of the goals of my upcoming junior sabbatical – Jeremy Wiebe, who’s been working with me on this project, is. He wrote a tutorial, “Building and Running Warcbase under OS X” for me.
Using Wiebe’s tutorial, you can ingest data into HBase and do a variety of things with it. For this task, I focused on the last section labelled ‘Pig Integration.’ If you want to ingest WARC files rather than ARC files, you need to call “WarcLoader” instead of “ArcLoader” on line 3. You then set your directory with WARCs on line 6, and your output directory on line 12. See:
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; DEFINE ArcLoader org.warcbase.pig.WarcLoader(); DEFINE ExtractLinks org.warcbase.pig.piggybank.ExtractLinks(); raw = load '/Users/ianmilligan1/warc-dir' using ArcLoader as (url: chararray, date: chararray, mime: chararray, content: bytearray); a = filter raw by mime == 'text/html'; b = foreach a generate url, FLATTEN(ExtractLinks((chararray) content)); store b into '/Users/ianmilligan1/warcbase-full/warcbase/output2';
On my single-node desktop (admittedly a powerful one), I was able to process 100GBs of WARC files in fifty minutes. This resulted in 17,142,800 links, arranged in a tabular-separated format with source URL, target URL, and the anchor text. 17,142,800 links in fifty minutes is pretty efficient! The files come out in a format like
part-m-00001, and so forth.
I then wanted to just isolate the .ca top-level domain ones. I’m not a regex master, but this command worked sufficiently:
grep -E ".+.ca/.+" part-m-* > all-canadian-links.txt
This simple regex found all lines that had a URL ending with .ca/. I’ll probably tweak it to make sure not too much else is slipping through, but for 5PM on a Tuesday it worked. This resulted in a text file that was just links. However, it looked like this:
part-m-00000:http://www.artsalive.ca/collections/posters/search_details.php?id_poster=70&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&hideinfo&hideinfo&moreinfo&switchlang&lang=fr http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlash&promoid=BIOW Adobe part-m-00000:http://www.artsalive.ca/collections/posters/search_details.php?id_poster=70&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&moreinfo&hideinfo&hideinfo&hideinfo&moreinfo&switchlang&lang=fr http://artsalive.ca/collections/costumes/search_results.php?lang=en&searchType=co
To snip out the file name, this worked:
cut -c 14- all-canadian-links.txt > all-canadian-links-cut.txt
To import into Gephi, I then opened it up using vim (it’s a 250MB text file) and added source, target, and anchor at the top of the file, separated by tabs. In Vim, cmd+v when inserting information lets you enter literal characters, such as tab breaks.
In Gephi, I imported the file as an edges table, copied the ID field over to the label field, and voila, I was able to explore the links of all .ca domains found within 100GB of WARC files.
Pretty cool stuff, and all doable in under an hour of computational time – and almost no active time. The next step will be to tweak the labels so that we can use this effectively, but right now we’re able to see the major university and economic clusters found within these WARC files.