As part of the Archives Unleashed hackathon, the Library of Congress graciously provided access to several of their collections. Jimmy Lin and myself worked with one of the teams, “The Supremes,” to see if we could generate useful scholarly derivatives from the underlying collections.
The team was called “The Supremes” for an apt reason: we worked with web archival data around the nominations for Justice Alito and Justice Roberts. These were two nominations that began in 2005, and contained legal blogs, Senatorial discussions, and other content relevant to those nominations.
As it was a datathon with limited time and resources, we used data subsets:
- Alito – 51 GB, 1.8 million records, 1.2 million pages
- Roberts – 41 GB – 1.4 million records, 1.0 million pages
Given the age of these collections, rather than being in WARC format, they were actually in the earlier (now depreciated) ARC format. But still, we were able to generate results quickly.
After two hours of Jimmy painstakingly hunting down some malfunctioning ARC – the web archival container format – files (the juicy details on how we’re going to fix that can be found here), the analysis began.
Within five minutes, we had useful scholarly derivatives and were already raising research questions. Continue reading