All the fun of warcbase.. in the cloud!
All the fun of warcbase.. in the cloud!

Part of the problem with warcbase is that you need a decently powerful laptop or desktop to run meaningful analysis on small-to-medium-sized collections (although it does run on a Raspberry Pi). While our team has access to a cluster, that’s not the case for everybody. What if you had a few WARC files and wanted to run some analysis on it with warcbase, without fully going through our documented yet sometimes challenging installation process? And if you wanted to deploy it on a cloud provider such as AWS? Maybe you have collections in an Amazon S3 bucket?

Note that this is all part of our exploration of ways that could eventually bring warcbase to a wider audience. It’s still command line based, unfortunately, and requires knowledge of the AWS stack. But for technically-inclined people or developers with web archival collections, we’re moving closer to helping you work with your collections.

Accordingly: enter the Warcbase Workshop repository! (more…)

(x-posted from our research group’s page)

microsoft-azure-logoThe Web Archives for Historical Research Group is happy to announce that we’ve received a $20,000 USD Microsoft Azure Research award. We’ll be using their HDInsight Spark service to run Warcbase jobs on our large collections, including selected Archive-It and Internet Archive data.

One of our bottlenecks in our work has been processing time, and the ability to have a dedicated Spark cluster will let us engage in both exploratory research, as well as the sustained processing to help make our research projects a reality.

Our sincerest thanks to Microsoft Research, as well as the University of Waterloo’s Arts Advancement Office and Principal Gifts team for facilitating the application.

walkAs part of our Web Archives for Longitudinal Knowledge project, we’ve been signing Memorandums of Agreement with Canadian Archive-It partners, ingesting their web archival collections into our Compute Canada system, and generating derivative datasets. An example of these collections is something like the University of Toronto’s Canadian Political Parties collection: a discrete collection on a focused topic, exploring matters of interest to researchers and everyday Canadians. As the size and number of collections begins to creep upwards – we’ve got about 15 TB of data spread over 46 collections (primarily from Alberta, Toronto, and Victoria) – our workflows need to scale to deal with this material.

More importantly, when one’s productivity is impacted by unfolding world events (it’s been a long week!), scripting means that the work still gets done.  (more…)

screen-shot-2016-10-18-at-6-47-09-pmI’m incredibly honoured to be one of the Program Co-Chairs of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), being held at the University of Toronto this June.

Our Call for Papers is now online, and I’d love to see your submissions. As a digital humanist myself, if anybody who’s unfamiliar with the ACM/IEEE world of conferences but is interested in submitting a paper, please do reach out to me. I’d love to hear your ideas, thoughts, and beyond.

Looking forward to seeing many of you in June! For the Call for Papers, please read on: (more…)

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is no comprehensive review of the field. Accordingly, our SAGE Handbook of Web History will provide an overview and point to future research directions. (more…)

Our notebook. Click through to find it.
Our notebook. Click through to find it.

Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.

William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.

Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash. (more…)

Screen Shot 2016-08-03 at 3.12.07 PM.png
Locations mentioned in North America in the Canadian Political Party archive collected in November 2015.

As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).

I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.

One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian? (more…)