New Article: “An Open-Source Strategy for Documenting Events”

Screen Shot 2016-04-26 at 1.08.31 PMNick Ruest and myself have a piece that’s just come out in Code4Lib Journal. The article takes readers through the (a) why Twitter matters for event archiving and future historical research; (b) how you can collect data yourself; and (c) how you can analyze the data. You can read the abstract below, and check out the article here!

As always, hope you enjoy reading it, and if you have any comments, questions, or anything, we are always happy to hear from you.

Abstract follows after the fold. Continue reading

New Article: “The Great WARC Adventure: Using SIPS, AIPS, and DIPS to document SLAPPs”

Screen Shot 2016-04-06 at 12.38.11 PMNick Ruest, Anna St-Onge, and myself have a piece that’s just come out in the open-access journal Digital Studies / Le champ numérique. The deliberately acronym-heavy title introduces an article that really takes us through the process of (a) creating a web archive; (b) preserving and providing access to the files; and (c) running some basic analysis on it from the perspective of a historian. While some of the text analysis done in the rear bit of the article predates more recent warcbase developments, I think it hopefully provides a great and useful conceptual approach.

You can find the article here, and abstract below. Hope you enjoy it. Continue reading

Obama and Twitter: Actually, Mr. President, We think Social Media Will Matter in the Future

Obama tweeting in happier days.

I don’t normally take partisan positions here at, especially in the rough and tumble world of American politics. But sometimes a line is crossed, and I cannot stay silent!😉

Speaking at a journalism event in late March 2016, American President Obama had this to say according to the Washington Examiner.

Ten, 20, 50 years from now, no one seeking to understand our age is going to be searching the tweets that got the most retweets, or the post that got the most likes … They’ll look for the kind of reporting, the smartest investigative journalism that told our story — lifted up the contradictions in our societies and asked the hard questions and forced people to see the truth even when it was uncomfortable.

I guess all this really shows is that President Obama doesn’t follow us on GitHub or on Twitter, or else he’d know that Nick Ruest and I have been tackling these very questions. Continue reading

New Article: “Lost in the Infinite Archive”

Screen Shot 2016-03-17 at 11.55.04 AM
My latest article, “Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives,” has just been published in the latest issue of the International Journal of Humanities and Arts Computing. My sincerest thanks to Jennifer Guiliano and Mia Ridge who edited the special issue on Complex Datasets in which it appeared. You can check out the entire table of contents here if you want to see the other fantastic contributions that are in the issue!

The beautiful, canonical version of record can be found here on Edinburgh University Press’ page. In accordance with my own personal values and the Tri-Agency Open Access Policy on Publications, you can find the author’s accepted manuscript version here in the University of Waterloo’s institutional repository.

What’s it about? The abstract is below, and if you’re interested, please do read it. Would love to hear your thoughts. Continue reading

Archives Unleashed 2.0 Call for Participation

A photo of the Library of Congress (taken by yours truly last week)

A photo of the Library of Congress (taken by yours truly last week)

We haven’t even had Archives Unleashed 1.0 – being held next week at the University of Toronto – so it’s beyond awesome that the second version has just been announced (the project team is Matt Weber, myself, and Jimmy Lin). It’ll be held at the Library of Congress on 14 – 15 June in Washington DC.

Check out the Call for Participation here! Lots of thanks to the National Science Foundation, University of Waterloo, Rutgers University, and the Social Sciences and Humanities Research Council of Canada.

Exploring the GeoCities Web Archive with Warcbase & Spark: Links (or how we can use warcbase to find amazing sites to ask historical questions!)


Not just spaghetti and meatballs, but the starting point for research.

In my last post, we left off with scripts running to extract all URLs and a link diagram. They finished decently quickly – about three days on our rho server at York University, or about 30 minutes on our roaringly-fast cluster. Given that hopefully we will be running these only once or twice at first, even three days isn’t too bad.

This is a relatively technology-heavy post, but I think even relative newcomers to digital history might find this interesting as an entree into several tools that we might use to extract meaningful historical information from big datasets.

Continue reading

Exploring the GeoCities Web Archive with Warcbase & Spark: Getting Started

Nick Ruest and I had some great news a few weeks ago: a collection of GeoCities WARCs was on its way on a few hard drives. I’ve previously done quite a bit of work on the GeoCities torrent, but as we’ve been doing parallel development on warcbase while working with the torrent, it’s been difficult to have one set of tools talk to our earlier dataset. Once we have all the files in WARC format, as we do now, we can use warcbase to generate derivative datasets.

Everybody should win, in theory, as it both helps research into GeoCities, research into warcbase, and research into web archival use more generally.

Step One: Ingesting the Data

Once the hard drives arrived, it was fun to watch the data populate our server as Nick supervised the time-consuming job of moving over 4TB of data from two hard drives onto our server at York University. Continue reading