Processing Archival Collections en Masse with Warcbase

walkAs part of our Web Archives for Longitudinal Knowledge project, we’ve been signing Memorandums of Agreement with Canadian Archive-It partners, ingesting their web archival collections into our Compute Canada system, and generating derivative datasets. An example of these collections is something like the University of Toronto’s Canadian Political Parties collection: a discrete collection on a focused topic, exploring matters of interest to researchers and everyday Canadians. As the size and number of collections begins to creep upwards – we’ve got about 15 TB of data spread over 46 collections (primarily from Alberta, Toronto, and Victoria) – our workflows need to scale to deal with this material.

More importantly, when one’s productivity is impacted by unfolding world events (it’s been a long week!), scripting means that the work still gets done. 

For each archival collection, we do the following:

  • screen-shot-2016-11-11-at-10-50-44-am

    Domain visualization

    Generate a domain count so we can generate a host visualization like this. You can find a list of collections here (we need to update it again soon).

  • Extract a dump of every URL in the collection, for further research work (when comparing collections this can be useful, especially when domain information is misleading).
  • Extract a link graph of every collection, so we can generate PageRank and explore network visualizations. You can see an example of this here.
  • Writing this file to a GDF format so that it can be opened by network software such as Gephi.
  • Extract the plain text of a collection, removing boilerplate components, for further textual analysis and collection comparison.

While we originally did this manually, it was becoming difficult. Accordingly, I’ve scripted this work so that we can take a new collection, type one command, and generate all these derivatives. It’s a bit boutique, but I think it shows how warcbase can effectively scale to generate meaningful derivatives.

At the core of our workflow lies a scala template, largely adopted from our warcbase documentation.

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
val ${COLLECTION} = RecordLoader.loadArchives("/data/${COLLECTION}/*.gz", sc).keepValidPages().map(r => (r.getCrawlMonth, r.getUrl)).saveAsTextFile("/data/derivatives/fullurls/${COLLECTION}")
val ${COLLECTION} = RecordLoader.loadArchives("/data/${COLLECTION}/*.gz", sc).keepValidPages().map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))).countItems().saveAsTextFile("/data/derivatives/urls/${COLLECTION}")
RecordLoader.loadArchives("/data/${COLLECTION}/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5).saveAsTextFile("/data/derivatives/links/${COLLECTION}")
val ${COLLECTION}gephi = RecordLoader.loadArchives("/data/${COLLECTION}/*.gz", sc) .keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGDF(${COLLECTION}gephi, "/data/derivatives/gephi/${COLLECTION}.gdf")
RecordLoader.loadArchives("/data/${COLLECTION}/*.gz", sc).keepValidPages().map(r => (r.getCrawlMonth, r.getDomain, r.getUrl, ExtractBoilerpipeText(r.getContentString))).saveAsTextFile("/data/derivatives/text/${COLLECTION}")

It’s a bit messy, mostly because it’s compacted down. But you’ll see that each line is a script adapted from the warcbase docs. The first three load libraries, line four generates full URLS, five the domain counts, six the network analysis, seven the Gephi chart, eight a Gephi export, and nine the full text.

It receives data from the {COLLECTION} variable, which is passed to it from this bash script.

#!/usr/bin/env bash
# i.e. sh WAHR_ymmfire

# create the new scala script to run
sed -e "s/\${COLLECTION}/$1/g" template.scala > $1.scala

# execute in Spark Shell
/home/ubuntu/project/spark-1.6.1-bin-hadoop2.6/bin/spark-shell -i /home/ubuntu/production/$1.scala --driver-memory 55G --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar | tee /home/ubuntu/production/$1.log

# combine part files, move to directory
cat /data/derivatives/fullurls/$1/part* > $1-fullurls.txt
mv $1-fullurls.txt /data/derivatives/fullurls

cat /data/derivatives/urls/$1/part* > $1-urls.txt
mv $1-urls.txt /data/derivatives/urls

cat /data/derivatives/links/$1/part* > $1-links.txt
mv $1-links.txt /data/derivatives/links

cat /data/derivatives/text/$1/part* > $1-text.txt
mv $1-text.txt /data/derivatives/text

Most of this is combining the part files, apart from the execution. You can see that it basically spins up a spark-shell, with our settings, and enables logging.

At the end of the day, it allows us to do this. We receive a new collection, save it the WARCs in a sub-directory of our /data drive in a standardized format (i.e. INSTITUTION_collection, or TORONTO_Canadian_Political_Parties). We then just have to run: TORONTO_Canadian_Political_Parties

And our data is there. I hope this shows how we’ve been able to use warcbase to speed things up.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s