Institutional vs. Twitter Seed Lists for Web Archives

Screen Shot 2015-11-25 at 5.26.35 PM

What if this collection was created by #elxn42 Twitter users – how different would it be?

With Nick Ruest and William Turkel, I’ve been exploring the tweets of the 42nd Canadian federal election that used the #elxn42 hashtag. Having also been part of the team that launched during the federal election, I wondered: what would a web archive created using the tweets of users look like compared to a formal collection, curated by a subject librarian? And secondly, how much of it would be in the Wayback Machine more generally?

The Canadian Political Parties and Political Interest Groups collection consists of fifty domains, which you can see here. Our #elxn42 collection contains of over a million and a half tweeted URLs, of which 263,708 were unique. How many of those 263,708 URLs could or will end up in the Canadian Political Parties collection?

To find out, we compared the two lists. I created one list of all of the CPP domains (a text file, one entry per line, such as,, etc.), and one list of all of the tweeted URLs (such as,, etc.). The useful command grep -wFf elxn42-tweets-urls-uniq.txt > intersections.txt helped compare the two, listing only links in our Twitter collection that contained the top-level domain from the former.

What did we find?

46,778, or 17.7%, of the URLs tweeted during #elxn42 could have been archived as part of the CPP collection. That means that 82.3% of URLs would not be part of the CPP collection.

What does this mean?

It raises some questions:

  • Should we be using Twitter more as a seed list for event-based archiving, similar to what Ed Summers proposed around say the events of Ferguson, MO or during natural disasters?
  • Should we be publicizing this more around, highlighting the gaps that you would find in a more formal web archive like that?
  • What would similar work find?

What about the Wayback Machine more generally?

One pressing research question is, of course, how many URLs end up in the Wayback Machine more generally. It is still early days – we’ll have to check back in later, because many URLs just wouldn’t have been crawled yet – but of those 263,708 URLs, 221,491 were not found using the Wayback Machine Availability API. That’s 83.99% of all URLs tweeted during #elxn42 that you can’t find in the Wayback Machine.

Using Warcbase with a Spark Notebook: What it is, and how to set it up

Those who were at the Web Archives 2015 conference and stuck around for my closing notebook saw some glimpses of one project that I’m part of: making warcbase, a powerful platform for hosting and providing analytics on web archives that’s developed by Jimmy Lin of the University of Waterloo, accessible to humanities researchers.

Screen Shot 2015-11-24 at 9.27.33 AM

Dynamically exploring a collection of ARC files, seeing what’s in them using Spark Notebook.

Long-time readers know that I’ve been a Mathematica programmer for the last five or six years. I’ve loved the notebook metaphor: a way to mix rich text, code, outputs, and visualizations together into a document. You can explain what you’re doing in rich marked-up text, run code, manipulate outputs, and basically have a data-rich document that helps inform what’s going on.

Those using the Jupyter platform (which grew out of the iPython project, which itself continues), have been able to experience this metaphor yourself.


From the Jupyter website, showing a living document

The idea was to take this richness and let you use Warcbase with it: once you have it up and running, you can use a GUI to run our scripts, rapidly prototype your own scripts (using a smaller subset of data to see if things work), and get a sense of your overall collection contours.

Want to see it for yourself?

  • This walkthrough, “Installing and Running Spark on OS X,” written by undergraduate research assistant extraordinaire Alice Zhou, shows you how to get everything set up.
  • This walkthrough, “Spark on EC2 or Compute Canada,” written by me (and thus far less extraordinary), shows you how to set up warcbase on a vanilla Ubuntu machine you could spin up in Amazon or another service provider (such as Compute Canada here in Canada).

Have fun warcbasing!

Post Firewall: “Mining the Internet Graveyard: Rethinking the Historians’ Toolkit”

Screen Shot 2015-11-09 at 5.40.16 PMAn article I published back in 2013 in the Journal of the Canadian Historical Review, entitled “Mining the Internet Graveyard: Rethinking the Historians’ Toolkit” is now post firewall and accessible online. It grew out of some of my first postdoctoral research and represents my first thoughts and engagement with web archives. It ended up getting quite a bit of traction: it was awarded the Best Article in the Journal of the Canadian Historical Association award by the Canadian Historical Association, and seemed to generate a good bit of discussion judging by where it’s been cited and used in courses.

It’s a bit dated now: the example I use with Canada’s Digital Collections wouldn’t be the one I take now, but for somebody who was really just starting down the programming rabbit hole, I think it’s a good reflection of where I was then. I’m still pretty happy with the first half of the piece, and where I situate web archiving within the historical profession and the digital humanities more generally.

Anyways, the abstract:

“Mining the Internet Graveyard” argues that the advent of massive quantity of born-digital historical sources necessitates a rethinking of the historians’ toolkit. The contours of a third wave of computational history are outlined, a trend marked by ever-increasing amounts of digitized information (especially web based), falling digital storage costs, a move to the cloud, and a corresponding increase in computational power to process these sources. Following this, the article uses a case study of an early born-digital archive at Library and Archives Canada – Canada’s Digital Collections project (CDC) – to bring some of these problems into view. An array of off-the-shelf data analysis solutions, coupled with code written in Mathematica, helps us bring context and retrieve information from a digital collection on a previously inaccessible scale. The article concludes with an illustration of the various computational tools available, as well as a call for greater digital literacy in history curricula and professional development.

One final note: it’s too bad it took so long for this to go open access, but options were limited in Canadian historiography when I was publishing this. The new Tri-Agency Open Access policy means, however, that I now have a very real and legal obligation to make my work accessible: and editors need to play along, or they don’t get to publish federally-funded scholars.

Some Appearances in the Media: CBC The Current, CBC KW and CBC Spark (go public broadcasting!)

I didn't take a picture this time when I was in studio, so this one from 2014 will have to do.

I didn’t take a picture this time when I was in studio, so this one from 2014 will have to do.

This morning, I had the pleasure to be on CBC’s flagship radio program, The Current with Anna Maria Tremonti, and was interviewed about “Preserving Digital History is Imperative to Save Cultural History.” You can listen to it yourself here.

I closed with a call for what I think should be the next moonshot for Canadian cultural preservation: let’s do non-print legal deposit web archiving like France, Britain, and Denmark. Given the incredible strides Library and Archives Canada has been making over the two plus years, with subject-specific web archiving, I think the moment might almost be here.

In August and September, while I was away on a writing sabbatical, our platform of archived political parties and political interest groups received a fair bit of media attention. I was interviewed on CBC Spark about the platform, which was a lot of fun. We also received good write-ups in the CBC and I wrote a short column for them about “5 things I’ve learned from combing an archive of old and deleted political websites.”

Maybe this is as much an advertisement for the importance of public broadcasting to bridging the gap between academia and the public as it is about our adventures in media.

Archives Unleashed CFP Now Out!

Screen Shot 2015-11-02 at 10.47.12 AMCall for Participation: Archives Unleashed: Web Archive Hackathon
Robarts Library, University of Toronto
3-5 March 2016

Travel grants available for graduate students, postdoctoral fellows, and contingent faculty
Applications due 4 December 2015

The World Wide Web has a profound impact on how we research and understand the past. The sheer amount of cultural information that is generated and, crucially, preserved every day in electronic form, presents exciting new opportunities for researchers. Much of this information is captured within web archives. Continue reading

Two New SSHRC Grants: Web Archive Analysis Insight and Hackathon Connect

I’m really happy to pass along some great news from the Social Sciences and Humanities Research Council of Canada: two grants that we’ve been working on here at the University of Waterloo and nearby universities have been funded. I think they bring together a fantastic team of researchers and I can’t wait to see what emerges! This should really help keep the Web Archives for Historical Research Group going for the next five or six years.

SSHRC Insight Grant on Historical Use of Web Archives

With Nick Ruest (York University) and William J. Turkel (Western University) as incredible co-applicants, and myself as PI, we’ve received a five-year SSHRC Insight Grant for “A Longitudinal Analysis of the Canadian World Wide Web as a Historical Resource.” Totalling $257,541, it will complement the Ontario Early Researcher Award held at UW and is aimed towards similar project outcomes. If you want to learn more about the project, this early Ontario ERA press release did a good job of distilling some of what we do (or, really, just check out my blog).

Funding from this grant has already gone towards supporting, as well as graduate students employed at our various universities. We’ve got some great things in the pipeline, beginning with leveraging Canadian Archive-It collections, and hopefully ramping up towards much larger perspectives on Canadian social history.

This grant began in June 2015, but is only being announced now.

SSHRC Connection Grant on Hackathon: Stay Tuned!

This grant brings together a great team: Matthew Weber (Rutgers University), Jimmy Lin (University of Waterloo), Nathalie Casemajor (University of Québec in Outaouais), and Nicholas Worby (University of Toronto), and myself, to host a web archives hackathon at the University of Toronto, March 3rd – 5th 2016.  The SSHRC Connection program, which helps mobilize academic research, has awarded us $23,715 to make the event possible. We also have generous in-kind and cash funding from the University of Toronto, University of Waterloo, Rutgers University, the University of Québec in Outaouais, Library and Archives Canada, the Internet Archive, and Compute Canada.

Details are currently in preparation, but a community call will be out shortly with details on the scope, the availability of travel grants, and other exciting logistical things.

If you do computational work with web archives, stay tuned for more info – but do save the date.

Press Release: Digital archive of political parties digs deep for Election 2015

Screen Shot 2015-08-26 at 3.37.32 PM

(X-posted from the University of Waterloo’s Media Relations page)

If you ever suspected Canadian politicians flip-flopped on a specific issue, or wondered where they stand on another, a new online tool will help you easily find out for sure.

Professor Ian Milligan at the University of Waterloo is charting the content of millions of archived political web pages spanning the last decade, allowing the public to compare what Canadian political leaders and pundits said in the past compared to now. pulls from collections that the University of Toronto Library has been collecting for a decade. Professor Milligan and his research team at Waterloo, as well as project collaborators from York University and Western University made the data searchable and accessible, drawing on code that staff at the British Library developed.

“We’ve got access to a collection of 50 archived websites from political parties and interest groups, allowing you to search them back to 2005,” said Milligan, a professor in the Department of History at Waterloo. “It means, for example, that anyone can find out what parties and groups said about climate change or free trade in the 2008 or 2011 election, or at any point between elections.” Continue reading