Processing Archival Collections en Masse with Warcbase

walkAs part of our Web Archives for Longitudinal Knowledge project, we’ve been signing Memorandums of Agreement with Canadian Archive-It partners, ingesting their web archival collections into our Compute Canada system, and generating derivative datasets. An example of these collections is something like the University of Toronto’s Canadian Political Parties collection: a discrete collection on a focused topic, exploring matters of interest to researchers and everyday Canadians. As the size and number of collections begins to creep upwards – we’ve got about 15 TB of data spread over 46 collections (primarily from Alberta, Toronto, and Victoria) – our workflows need to scale to deal with this material.

More importantly, when one’s productivity is impacted by unfolding world events (it’s been a long week!), scripting means that the work still gets done.  Continue reading

ACM/IEEE Joint Conference on Digital Libraries: CFP Now Available

screen-shot-2016-10-18-at-6-47-09-pmI’m incredibly honoured to be one of the Program Co-Chairs of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), being held at the University of Toronto this June.

Our Call for Papers is now online, and I’d love to see your submissions. As a digital humanist myself, if anybody who’s unfamiliar with the ACM/IEEE world of conferences but is interested in submitting a paper, please do reach out to me. I’d love to hear your ideas, thoughts, and beyond.

Looking forward to seeing many of you in June! For the Call for Papers, please read on: Continue reading

CFP: SAGE Handbook of Web History

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is no comprehensive review of the field. Accordingly, our SAGE Handbook of Web History will provide an overview and point to future research directions. Continue reading

Reading WARC Records with Mathematica

Our notebook. Click through to find it.

Our notebook. Click through to find it.

Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.

William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.

Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash. Continue reading

Plotting and Comparing Locations Mentioned in a Web Archive: Warcbase, OpenRefine, and Google Fusion Tables

Screen Shot 2016-08-03 at 3.12.07 PM.png

Locations mentioned in North America in the Canadian Political Party archive collected in November 2015.

As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).

I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.

One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian? Continue reading

Finding Popular Images within a Web Archive: Exploring GeoCities

I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.

Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).

An idea around this is to play with the unique hash of each image. In the past, I’ve used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn’t really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let’s try to see what we can do with them… Continue reading

Investigating Curatorial Models as the Marshall McLuhan Centenary Fellow in Digital Sustainability

Most importantly, I'll be getting to work in my favourite building in North America!

Most importantly, I’ll be getting to work in my favourite building in North America!

Great news! Starting on July 1st, I’m the inaugural Marshall McLuhan Centenary Fellow in Digital Sustainability, held at the University of Toronto’s Digital Curation Institute, which is housed in their Faculty of Information. The DCI is led by Christoph Becker, who I’m really looking forward to working with more over the next 12 months (as well as his great graduate students).

What does this mean? Basically, over the next year I’ll be hosting the following public events in Toronto. This will primarily be taking place in the January – May timeframe, and I will be in Toronto roughly once-a-week during this period. It is also an excuse to be physically proximate to great collaborators: folks at the DCI, Toronto libraries (especially Nich Worby who I’ve worked with quite a bit), and York (where my frequent collaborator Nick Ruest is based).

  • Workshops: I’ll run a web archiving analysis workshop in Toronto, probably focusing on the warcbase platform – perhaps riding the coattails of great virtual machine and repository that Nick Ruest developed. I would also like to run a workshop on Twitter archiving and analysis.
  • Give an Invited Lecture: I’ll be giving a Coach House Institute lecture on the findings of the Fellowship research project, discussed below;
  • Organize a Marquee Event: I’d like to help the DCI with bringing in a high-profile invited speaker to discuss web archiving. Maybe I can score some free canapés.

Most importantly, I’ll be carrying out a research project on qualitative comparisons of web archival content, specifically the kinds of content curated using a social media approach versus a manually-curated professional one. Continue reading