The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

(x-posted from our Archives Unleashed Medium blog)

by Ian Milligan, Jimmy Lin, and Nick Ruest

We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

Since that announcement, we’ve been busy at work at a few different things: modernizing and updating our Warcbase web archiving analytics platform, working on a discovery interface and underlying infrastructure, and laying the administrative groundwork for the project itself. We’ll write a bit more next week about the front-end, but for now we wanted to announce the next version of our web archiving toolkit.

Warcbase is dead…long live the Archives Unleashed Toolkit

We’ve been busy working away on a 1.0 release of the Archives Unleashed Toolkit, or AUT. AUT grows out of the analytics functions of Warcbase, which has now officially been deprecated. We’ve left behind the Apache HBase and Wayback functionality, focusing instead on the Apache Spark-based open-source platform for analyzing web archives. As we leave HBase behind, Warcbase increasingly didn’t make sense as a name. It really is a toolkit to open up web archives for scholars, hence the “Archives Unleashed Toolkit.”

If you want to take AUT for a spin, you can download the 0.9.0 release jar, setup Apache Spark locally, and work through the tutorial we have available here. If you haven’t setup Apache Spark before, we have a helpful “Getting Started” guide. The jar means that you don’t have to build it yourself!

What’s the technical roadmap look like for AUT moving forward? The 0.9.0 release was focused on codebase clean-up (Java docs too!), and getting the project setup on Sonatype. The next release will be moving the project to Apache Spark 2.0, which will allow us to move to Spark SQL and DataFrames. Also on the roadmap is PySpark support.

If you’re interested in reading about the history of Warcbase and how it was used to explore collections, feel free to check out this article:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.

Stay Tuned…

We’ll be back next week to talk about Warclight, our Project Blacklight based discovery interface for web archives. See you again soon!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s