“Understanding Computational Web Archives Research Methods Using Research Objects”, IEEE Big Data Computational Archival Science Paper

screen-shot-2016-12-09-at-10-38-10-amAt the same time as I was in Denmark presenting at the National Webs workshop, my co-authors and University of Toronto collaborators Emily Maemura and Christoph Becker were presenting a paper we wrote at the IEEE Big Data Computational Archival Science Workshop.  This is the first fruits of the McLuhan Fellowship I’m doing at the University of Toronto, and introduces a model for using Research Objects with web archival research.

Abstract is below, and you can find the full pre-print here.

Use of computational methods for exploration and analysis of web archives sources is emerging in new disciplines such as digital humanities. This raises urgent questions about how such research projects process web archival material using computational methods to construct their findings. This paper aims to enable web archives scholars to document their practices systematically to improve the transparency of their methods. We adopt the Research Object framework to characterize three case studies that use computational methods to analyze web archives within digital history research. We then discuss how the framework can support the characterization of research methods and serve as a basis for discussions of methods and issues such as reuse and provenance. The results suggest that the framework provides an effective conceptual perspective to describe and analyze the computational methods used in web archive research on a high level and make transparent the choices made in the process. The documentation of the research process contributes to a better understanding of the findings and their provenance, and the possible reuse of data, methods, and workflows.

“Studying the Web in the Shadow of Uncle Sam”: Canada at the National Webs Workshop

canadian-national-web-001I had the opportunity to present “Studying the Web in the Shadow of Uncle Sam,” a paper that Tom Smyth (Library and Archives Canada) and myself proposed to the National Webs workshop that Niels Brügger and Ditte Laursen hosted in Aarhus, Denmark. Our paper abstract is below, as are the slides that I presented.

Hopefully more cool things can be announced soon (and the paper should appear in an edited collection, hopefully in early 2018 – our drafts go in for March).

What is the Canadian Web? While Canada does have the .ca top-level domain, this does not capture Canada: while universities, governmental institutions, and some companies use the .ca TLD, many other corporations, small businesses, bloggers, and others generally gravitate towards .com, .org, or .net (a question we will briefly explore in our paper).  In short, the .ca domain is a relatively niche player. Analyses using just the top-level domain would be skewed towards certain forms of content providers. This question presents considerable challenge for national libraries and researchers working in a national perspective on an inherently global network.

We will approach this question in three ways within this paper. First, we present the state of the Canadian Web. Drawing on initial work by Library and Archives Canada and twenty-five Archive-It partners in Canada, we discuss what it means to study the Canadian Web. Second, we explore what work has been done to date: how Library and Archives Canada and Canadian partners have embraced the challenge and what they have been collecting. How do librarians and archivists select their seeds in this context, and does it approach a national web? This collection development strategy is an interim one, beginning to lay the foundation for greater capacity for domain crawling. “Thematic web collections” steward parts of the Canadian Web, with a recognition of the stopgap nature of things. Finally, we use the piece to show various paths forward towards a domain crawl of the “Canadian Web,” highlighting the Web Archives for Longitudinal Knowledge (WALK) project that is beginning to integrate disparate web archives across the country.

Click on the first slide and you can view it as a slideshow! Continue reading

Quickly Deploying Warcbase with Amazon Web Services

All the fun of warcbase.. in the cloud!

All the fun of warcbase.. in the cloud!

Part of the problem with warcbase is that you need a decently powerful laptop or desktop to run meaningful analysis on small-to-medium-sized collections (although it does run on a Raspberry Pi). While our team has access to a cluster, that’s not the case for everybody. What if you had a few WARC files and wanted to run some analysis on it with warcbase, without fully going through our documented yet sometimes challenging installation process? And if you wanted to deploy it on a cloud provider such as AWS? Maybe you have collections in an Amazon S3 bucket?

Note that this is all part of our exploration of ways that could eventually bring warcbase to a wider audience. It’s still command line based, unfortunately, and requires knowledge of the AWS stack. But for technically-inclined people or developers with web archival collections, we’re moving closer to helping you work with your collections.

Accordingly: enter the Warcbase Workshop repository! Continue reading

Web Archives for Historical Research Group Receives Support from Microsoft Research

(x-posted from our research group’s page)

microsoft-azure-logoThe Web Archives for Historical Research Group is happy to announce that we’ve received a $20,000 USD Microsoft Azure Research award. We’ll be using their HDInsight Spark service to run Warcbase jobs on our large collections, including selected Archive-It and Internet Archive data.

One of our bottlenecks in our work has been processing time, and the ability to have a dedicated Spark cluster will let us engage in both exploratory research, as well as the sustained processing to help make our research projects a reality.

Our sincerest thanks to Microsoft Research, as well as the University of Waterloo’s Arts Advancement Office and Principal Gifts team for facilitating the application.

Processing Archival Collections en Masse with Warcbase

walkAs part of our Web Archives for Longitudinal Knowledge project, we’ve been signing Memorandums of Agreement with Canadian Archive-It partners, ingesting their web archival collections into our Compute Canada system, and generating derivative datasets. An example of these collections is something like the University of Toronto’s Canadian Political Parties collection: a discrete collection on a focused topic, exploring matters of interest to researchers and everyday Canadians. As the size and number of collections begins to creep upwards – we’ve got about 15 TB of data spread over 46 collections (primarily from Alberta, Toronto, and Victoria) – our workflows need to scale to deal with this material.

More importantly, when one’s productivity is impacted by unfolding world events (it’s been a long week!), scripting means that the work still gets done.  Continue reading

ACM/IEEE Joint Conference on Digital Libraries: CFP Now Available

screen-shot-2016-10-18-at-6-47-09-pmI’m incredibly honoured to be one of the Program Co-Chairs of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), being held at the University of Toronto this June.

Our Call for Papers is now online, and I’d love to see your submissions. As a digital humanist myself, if anybody who’s unfamiliar with the ACM/IEEE world of conferences but is interested in submitting a paper, please do reach out to me. I’d love to hear your ideas, thoughts, and beyond.

Looking forward to seeing many of you in June! For the Call for Papers, please read on: Continue reading

CFP: SAGE Handbook of Web History

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is no comprehensive review of the field. Accordingly, our SAGE Handbook of Web History will provide an overview and point to future research directions. Continue reading