As part of the Archives Unleashed hackathon, the Library of Congress graciously provided access to several of their collections. Jimmy Lin and myself worked with one of the teams, “The Supremes,” to see if we could generate useful scholarly derivatives from the underlying collections.
The team was called “The Supremes” for an apt reason: we worked with web archival data around the nominations for Justice Alito and Justice Roberts. These were two nominations that began in 2005, and contained legal blogs, Senatorial discussions, and other content relevant to those nominations.
As it was a datathon with limited time and resources, we used data subsets:
- Alito – 51 GB, 1.8 million records, 1.2 million pages
- Roberts – 41 GB – 1.4 million records, 1.0 million pages
Given the age of these collections, rather than being in WARC format, they were actually in the earlier (now depreciated) ARC format. But still, we were able to generate results quickly.
After two hours of Jimmy painstakingly hunting down some malfunctioning ARC – the web archival container format – files (the juicy details on how we’re going to fix that can be found here), the analysis began.
Within five minutes, we had useful scholarly derivatives and were already raising research questions.
Link Graphs
.@ianmilligan1 created a link cloud from the Alito Supreme Court nomination web archive. #hackarchives pic.twitter.com/KUUYAYopul
— Andrew Weber (@atweber) June 14, 2016
//platform.twitter.com/widgets.js
import org.warcbase.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader, WriteGDF} import org.warcbase.spark.rdd.RecordRDD._ val links = RecordLoader.loadArchives("/collections/webarchives/scotus/alito/*.arc.gz", sc) .keepValidPages() .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))) .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))) .filter(r => r._2 != "" && r._3 != "") .countItems() .filter(r => r._2 > 5) WriteGDF(links, "alito.sitelinks.gdf")
This warcbase script, written in Scala, lets us change a few lines: the loadArchives
line (line #4) had to point at the ARC files, and then writeGDF
line (line #12) needed to point where we wanted the output.
That’s it! Suddenly, we have results. You can see all the derivative datasets here. The GDF file format is especially important as it opens up natively in the Gephi network analysis suite. We have instructions on how to fruitfully use Gephi here.
Consider the following Alito visualizations:



Targeted textual analysis
We then did some targeted textual analysis. The research team was interested in exploring the text coming out of Senate.gov around both nominations. The following script made that possible:
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader} import org.warcbase.spark.rdd.RecordRDD._ RecordLoader.loadArchives("/collections/webarchives/scotus/alito/*.arc.gz", sc) .keepValidPages() .keepUrlPatterns(Set("http://.*senate.gov/.*".r)) .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))) .saveAsTextFile("alito-senate-text-gov")
Again, line 4, line 6, and line 8 are the most important. Line 4 and 8 needed to be changed to say where the inputs and outputs were, and on line 6 we customized the patterns we were looking for in the data.
For example, changing the same query to the roberts data instead of alito was relatively simple:
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader} import org.warcbase.spark.rdd.RecordRDD._ RecordLoader.loadArchives("/collections/webarchives/scotus/roberts/*.arc.gz", sc) .keepValidPages() .keepUrlPatterns(Set("http://.*senate.gov/.*".r)) .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))) .saveAsTextFile("roberts-senate-text-gov")
The results were manageable: 250MB of plain text for each nomination.
In our final post, I will share the final results from each team to let them explain the cool stuff they’ve found in their own words!
[…] network visualization of links between web domains publishing content on the 2005 nomination of Supreme Court Justice Alito. Generated by Ian Milligan, in collaboration with the Supreme hackathon team. Data provided by the […]
[…] https://ianmilligan.ca/2016/06/22/archives-unleashed-part-two-unlocking-library-of-congress-collecti… […]