The University of Waterloo and York University have been awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.
The grant, valued at $610,625, supports Archives Unleashed, a project that will develop web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web. It is additionally supported by generous in-kind and financial contributions from Start Smart Labs, Compute Canada, York University Libraries and the University of Waterloo’s Faculty of Arts.
Warcbase is great, so we often get lots of questions about how to install it – never straightforward when using a piece of software with multiple dependencies. While we’re hoping to make some significant changes to the process of installing and using Warcbase in the near future, in the short to medium term I wanted to make a slightly simpler guide.
If you’re interested in using Warcbase to analyze your web archives, you might find this PDF helpful. I will have some tutorials available soon to run some scripts on your own data – and generate a standard set of derivatives.
Well, I certainly won’t win any awards for “most concise chapter title,” but my latest publication “Welcome to the Web: The Online Community of GeoCities during the early years of the World Wide Web,” is now available in the open-access publication The Web as History. This book, edited by Niels Brügger and Ralph Schroeder, has been published by UCL Press. They’re an innovative, fully-open access university press. You can download the entire book as a PDF, or also purchase paperback or hardback copies if you so desire.
Anyways, please do feel free to read the chapter if it strikes your fancy. Here’s an excerpt from the introduction below the fold: (more…)
Another year, another wonderful trip down to the American Historical Association’s annual meeting – this time, down in stormbound yet beautiful Denver, Colorado. I had the pleasure of leading an intermediate session at the Getting Started in Digital History workshop, giving a paper at the conference, and also participating in the Digital Drop-In.
Web Scraping Workshop
I led the web scraping workshop at the Getting Started in Digital History workshop. I’ve tried to make it as accessible as one can online: slides, links, and beyond are all online. If you’re curious about how to grab data online, what scraping resources there are, and how to work with social media, please do check it out.
Collaborative Digital History Presentation
I gave a paper on collaborative digital history, looking at how our team has been able to do what it has been able to do. I began by talking about our research project’s objectives (web archives and historical research), and then general thoughts on team work and how we’ve achieved success with two of our projects. It was a round table discussion so we had an incredible conversation afterwards.
I had the opportunity to publish “The Problem of History in the Age of Abundance” in the December 16th issue of the Chronicle Review. It is now available online, albeit behind a paywall or some links on social media. You can read it here now for a limited time. Writing for the Chronicle was a great experience: fantastic editors, copyediting, and thoughtful engagement with my work.
This in a nut shell is the core argument of a much larger manuscript that I’m working on, so any general thoughts or comments are always appreciated.
At the same time as I was in Denmark presenting at the National Webs workshop, my co-authors and University of Toronto collaborators Emily Maemura and Christoph Becker were presenting a paper we wrote at the IEEE Big Data Computational Archival Science Workshop. This is the first fruits of the McLuhan Fellowship I’m doing at the University of Toronto, and introduces a model for using Research Objects with web archival research.
Abstract is below, and you can find the full pre-print here.
Use of computational methods for exploration and analysis of web archives sources is emerging in new disciplines such as digital humanities. This raises urgent questions about how such research projects process web archival material using computational methods to construct their findings. This paper aims to enable web archives scholars to document their practices systematically to improve the transparency of their methods. We adopt the Research Object framework to characterize three case studies that use computational methods to analyze web archives within digital history research. We then discuss how the framework can support the characterization of research methods and serve as a basis for discussions of methods and issues such as reuse and provenance. The results suggest that the framework provides an effective conceptual perspective to describe and analyze the computational methods used in web archive research on a high level and make transparent the choices made in the process. The documentation of the research process contributes to a better understanding of the findings and their provenance, and the possible reuse of data, methods, and workflows.
I had the opportunity to present “Studying the Web in the Shadow of Uncle Sam,” a paper that Tom Smyth (Library and Archives Canada) and myself proposed to the National Webs workshop that Niels Brügger and Ditte Laursen hosted in Aarhus, Denmark. Our paper abstract is below, as are the slides that I presented.
Hopefully more cool things can be announced soon (and the paper should appear in an edited collection, hopefully in early 2018 – our drafts go in for March).
What is the Canadian Web? While Canada does have the .ca top-level domain, this does not capture Canada: while universities, governmental institutions, and some companies use the .ca TLD, many other corporations, small businesses, bloggers, and others generally gravitate towards .com, .org, or .net (a question we will briefly explore in our paper). In short, the .ca domain is a relatively niche player. Analyses using just the top-level domain would be skewed towards certain forms of content providers. This question presents considerable challenge for national libraries and researchers working in a national perspective on an inherently global network.
We will approach this question in three ways within this paper. First, we present the state of the Canadian Web. Drawing on initial work by Library and Archives Canada and twenty-five Archive-It partners in Canada, we discuss what it means to study the Canadian Web. Second, we explore what work has been done to date: how Library and Archives Canada and Canadian partners have embraced the challenge and what they have been collecting. How do librarians and archivists select their seeds in this context, and does it approach a national web? This collection development strategy is an interim one, beginning to lay the foundation for greater capacity for domain crawling. “Thematic web collections” steward parts of the Canadian Web, with a recognition of the stopgap nature of things. Finally, we use the piece to show various paths forward towards a domain crawl of the “Canadian Web,” highlighting the Web Archives for Longitudinal Knowledge (WALK) project that is beginning to integrate disparate web archives across the country.
Click on the first slide and you can view it as a slideshow! (more…)