Archives Unleashed, Part Two: Unlocking Library of Congress Collections with Warcbase

As part of the Archives Unleashed hackathon, the Library of Congress graciously provided access to several of their collections. Jimmy Lin and myself worked with one of the teams, “The Supremes,” to see if we could generate useful scholarly derivatives from the underlying collections.

The team was called “The Supremes” for an apt reason: we worked with web archival data around the nominations for Justice Alito and Justice Roberts. These were two nominations that began in 2005, and contained legal blogs, Senatorial discussions, and other content relevant to those nominations.

As it was a datathon with limited time and resources, we used data subsets:

  • Alito – 51 GB, 1.8 million records, 1.2 million pages
  • Roberts – 41 GB – 1.4 million records, 1.0 million pages

Given the age of these collections, rather than being in WARC format, they were actually in the earlier (now depreciated) ARC format. But still, we were able to generate results quickly.

After two hours of Jimmy painstakingly hunting down some malfunctioning ARC – the web archival container format – files (the juicy details on how we’re going to fix that can be found here), the analysis began.

Within five minutes, we had useful scholarly derivatives and were already raising research questions.

Link Graphs

//platform.twitter.com/widgets.js

import org.warcbase.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader, WriteGDF}
import org.warcbase.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("/collections/webarchives/scotus/alito/*.arc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)

WriteGDF(links, "alito.sitelinks.gdf")

This warcbase script, written in Scala, lets us change a few lines: the loadArchives line (line #4) had to point at the ARC files, and then writeGDF line (line #12) needed to point where we wanted the output.

That’s it! Suddenly, we have results. You can see all the derivative datasets here. The GDF file format is especially important as it opens up natively in the Gephi network analysis suite. We have instructions on how to fruitfully use Gephi here.

Consider the following Alito visualizations:

The domain-to-domain links found within the Alito collection, 2005

The domain-to-domain links found within the Alito collection, 2005

Screen Shot 2016-06-18 at 1.20.50 PM

In this example, I mouse over senate.gov to see where it linked to and was linked from. Note the prominence of the New York Times.

Screen Shot 2016-06-18 at 1.20.56 PM

And then I move to mouse over the house. Note that in addition to the New York Times, the Washington Post comes into relief. A research question?

Targeted textual analysis

We then did some targeted textual analysis. The research team was interested in exploring the text coming out of Senate.gov around both nominations. The following script made that possible:

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/collections/webarchives/scotus/alito/*.arc.gz", sc)
.keepValidPages()
.keepUrlPatterns(Set("http://.*senate.gov/.*".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("alito-senate-text-gov")

Again, line 4, line 6, and line 8 are the most important. Line 4 and 8 needed to be changed to say where the inputs and outputs were, and on line 6 we customized the patterns we were looking for in the data.

For example, changing the same query to the roberts data instead of alito was relatively simple:

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/collections/webarchives/scotus/roberts/*.arc.gz", sc)
.keepValidPages()
.keepUrlPatterns(Set("http://.*senate.gov/.*".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("roberts-senate-text-gov")

The results were manageable: 250MB of plain text for each nomination.

In our final post, I will share the final results from each team to let them explain the cool stuff they’ve found in their own words!

2 thoughts on “Archives Unleashed, Part Two: Unlocking Library of Congress Collections with Warcbase

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s