The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

(x-posted from our Archives Unleashed Medium blog)

by Ian Milligan, Jimmy Lin, and Nick Ruest

We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Read more

New Article: “Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives”

Screen Shot 2017-08-09 at 10.52.58 AM
Our new article!

We have a new article out! Jimmy Lin, Jeremy Wiebe, Alice Zhou, and myself have a new piece in the ACM Journal on Computing and Cultural Heritage. This piece, a collaboration between two computer scientists (Jimmy and Alice) and two historians (Jeremy and myself), both introduces Warcbase as well as our Filter-Analyze-Aggregate-Visualize (FAAV) cycle for working with large-scale web archives.

You can check out the article here in the ACM Digital Library.

Topic Shifts Between Two US Presidential Administrations

By Ziquan Wang, Borui Lin, Ian Milligan, and Jimmy Lin

While Americans are busy enjoying their Fourth of July, us Canadians are digging into data… and indeed, we wanted to share some research recently presented at the Web Archives and Digital Libraries workshop.

Shortly after Donald Trump’s inauguration as President of the United States, eagle eyed observers noted a crucial difference in his webpage as compared to his predecessor, President Obama. Whereas Obama’s information page had listed the three branches of the US government: executive, judicial, and legislative, Trump’s page listed only two.

Examples like this made our research team at the University of Waterloo wonder: could we systematically begin to track the changes in discourses, priorities, topics, and beyond between two US Presidential elections, and more so, could we do so on a budget? As I’ve argued elsewhere, web archives are of crucial importance for historians seeking to understand any period after 1996. Yet the scale requires us to turn to digital methods. We cannot go page by page through websites, but rather we need tools to extract the information that we need. Could we “distantly read” websites to notice shifts like observers did in the early days of the Trump administration?

Luckily for us, students had just finished taking Jimmy Lin’s (awesome) Big Data Infrastructure course and wanted to exercise their skills.  The amazing Ziquan Wang and Borui Lin joined us and set out to explore shifts between two American presidential administrations.

But first, we needed the data… Read more

Grant news: Multidisciplinary project will help historians unlock billions of archived web pages

Some exciting news! Nick Ruest, Jimmy Lin, and myself will be leading a three-year project into web archiving analysis and community building. From the story:

The University of Waterloo and York University have been awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

The grant, valued at $610,625, supports Archives Unleashed, a project that will develop web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web. It is additionally supported by generous in-kind and financial contributions from Start Smart Labs, Compute Canada, York University Libraries and the University of Waterloo’s Faculty of Arts.   

You can read more at the full story here.

Warcbase Install Guide for OS X and Linux

Screen Shot 2017-04-20 at 11.08.25 AMWarcbase is great, so we often get lots of questions about how to install it – never straightforward when using a piece of software with multiple dependencies. While we’re hoping to make some significant changes to the process of installing and using Warcbase in the near future, in the short to medium term I wanted to make a slightly simpler guide.

If you’re interested in using Warcbase to analyze your web archives, you might find this PDF helpful. I will have some tutorials available soon to run some scripts on your own data – and generate a standard set of derivatives.

Here are some walkthroughs for a workshop I’m running, in nice-easy-to-use paste-able HTML: Warcbase Installation and Penn State Warcbase Workshop.

Download the PDF here (4MB download).

New Chapter: “Welcome to the Web: The Online Community of GeoCities during the early years of the World Wide Web”

Screen Shot 2017-03-14 at 10.53.12 AM
First page of the article.

Well, I certainly won’t win any awards for “most concise chapter title,” but my latest publication “Welcome to the Web: The Online Community of GeoCities during the early years of the World Wide Web,” is now available in the open-access publication The Web as History. This book, edited by Niels Brügger and Ralph Schroeder, has been published by UCL Press. They’re an innovative, fully-open access university press. You can download the entire book as a PDF, or also purchase paperback or hardback copies if you so desire.

Anyways, please do feel free to read the chapter if it strikes your fancy. Here’s an excerpt from the introduction below the fold: Read more

Web Scraping & Collaborative Digital History at AHA 2017

Another year, another wonderful trip down to the American Historical Association’s annual meeting – this time, down in stormbound yet beautiful Denver, Colorado. I had the pleasure of leading an intermediate session at the Getting Started in Digital History workshop, giving a paper at the conference, and also participating in the Digital Drop-In.

web-scraping-001Web Scraping Workshop

I led the web scraping workshop at the Getting Started in Digital History workshop. I’ve tried to make it as accessible as one can online: slides, links, and beyond are all online. If you’re curious about how to grab data online, what scraping resources there are, and how to work with social media, please do check it out.

All workshop resources can be found here.

screen-shot-2017-01-09-at-10-47-50-amCollaborative Digital History Presentation

I gave a paper on collaborative digital history, looking at how our team has been able to do what it has been able to do. I began by talking about our research project’s objectives (web archives and historical research), and then general thoughts on team work and how we’ve achieved success with two of our projects. It was a round table discussion so we had an incredible conversation afterwards.

You can find my slides here.