Web Archives for Historical Research Group Receives Support from Microsoft Research

(x-posted from our research group’s page)

microsoft-azure-logoThe Web Archives for Historical Research Group is happy to announce that we’ve received a $20,000 USD Microsoft Azure Research award. We’ll be using their HDInsight Spark service to run Warcbase jobs on our large collections, including selected Archive-It and Internet Archive data.

One of our bottlenecks in our work has been processing time, and the ability to have a dedicated Spark cluster will let us engage in both exploratory research, as well as the sustained processing to help make our research projects a reality.

Our sincerest thanks to Microsoft Research, as well as the University of Waterloo’s Arts Advancement Office and Principal Gifts team for facilitating the application.

Processing Archival Collections en Masse with Warcbase

walkAs part of our Web Archives for Longitudinal Knowledge project, we’ve been signing Memorandums of Agreement with Canadian Archive-It partners, ingesting their web archival collections into our Compute Canada system, and generating derivative datasets. An example of these collections is something like the University of Toronto’s Canadian Political Parties collection: a discrete collection on a focused topic, exploring matters of interest to researchers and everyday Canadians. As the size and number of collections begins to creep upwards – we’ve got about 15 TB of data spread over 46 collections (primarily from Alberta, Toronto, and Victoria) – our workflows need to scale to deal with this material.

More importantly, when one’s productivity is impacted by unfolding world events (it’s been a long week!), scripting means that the work still gets done.  Continue reading

ACM/IEEE Joint Conference on Digital Libraries: CFP Now Available

screen-shot-2016-10-18-at-6-47-09-pmI’m incredibly honoured to be one of the Program Co-Chairs of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), being held at the University of Toronto this June.

Our Call for Papers is now online, and I’d love to see your submissions. As a digital humanist myself, if anybody who’s unfamiliar with the ACM/IEEE world of conferences but is interested in submitting a paper, please do reach out to me. I’d love to hear your ideas, thoughts, and beyond.

Looking forward to seeing many of you in June! For the Call for Papers, please read on: Continue reading

CFP: SAGE Handbook of Web History

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is no comprehensive review of the field. Accordingly, our SAGE Handbook of Web History will provide an overview and point to future research directions. Continue reading

Reading WARC Records with Mathematica

Our notebook. Click through to find it.

Our notebook. Click through to find it.

Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.

William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with webrecorder.io – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.

Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash. Continue reading

Plotting and Comparing Locations Mentioned in a Web Archive: Warcbase, OpenRefine, and Google Fusion Tables

Screen Shot 2016-08-03 at 3.12.07 PM.png

Locations mentioned in North America in the Canadian Political Party archive collected in November 2015.

As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).

I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.

One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian? Continue reading

Finding Popular Images within a Web Archive: Exploring GeoCities

I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.

Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).

An idea around this is to play with the unique hash of each image. In the past, I’ve used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn’t really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let’s try to see what we can do with them… Continue reading