Niels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!
The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is no comprehensive review of the field. Accordingly, our SAGE Handbook of Web History will provide an overview and point to future research directions. Continue reading
Our notebook. Click through to find it.
Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.
William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with webrecorder.io – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.
Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash. Continue reading
As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).
I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.
One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian? Continue reading
I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.
Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e.
An idea around this is to play with the unique hash of each image. In the past, I’ve used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn’t really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let’s try to see what we can do with them… Continue reading
Most importantly, I’ll be getting to work in my favourite building in North America!
Great news! Starting on July 1st, I’m the inaugural Marshall McLuhan Centenary Fellow in Digital Sustainability, held at the University of Toronto’s Digital Curation Institute, which is housed in their Faculty of Information. The DCI is led by Christoph Becker, who I’m really looking forward to working with more over the next 12 months (as well as his great graduate students).
What does this mean? Basically, over the next year I’ll be hosting the following public events in Toronto. This will primarily be taking place in the January – May timeframe, and I will be in Toronto roughly once-a-week during this period. It is also an excuse to be physically proximate to great collaborators: folks at the DCI, Toronto libraries (especially Nich Worby who I’ve worked with quite a bit), and York (where my frequent collaborator Nick Ruest is based).
- Workshops: I’ll run a web archiving analysis workshop in Toronto, probably focusing on the warcbase platform – perhaps riding the coattails of great virtual machine and repository that Nick Ruest developed. I would also like to run a workshop on Twitter archiving and analysis.
- Give an Invited Lecture: I’ll be giving a Coach House Institute lecture on the findings of the Fellowship research project, discussed below;
- Organize a Marquee Event: I’d like to help the DCI with bringing in a high-profile invited speaker to discuss web archiving. Maybe I can score some free canapés.
Most importantly, I’ll be carrying out a research project on qualitative comparisons of web archival content, specifically the kinds of content curated using a social media approach versus a manually-curated professional one. Continue reading
As part of the Archives Unleashed hackathon, the Library of Congress graciously provided access to several of their collections. Jimmy Lin and myself worked with one of the teams, “The Supremes,” to see if we could generate useful scholarly derivatives from the underlying collections.
The team was called “The Supremes” for an apt reason: we worked with web archival data around the nominations for Justice Alito and Justice Roberts. These were two nominations that began in 2005, and contained legal blogs, Senatorial discussions, and other content relevant to those nominations.
As it was a datathon with limited time and resources, we used data subsets:
- Alito – 51 GB, 1.8 million records, 1.2 million pages
- Roberts – 41 GB – 1.4 million records, 1.0 million pages
Given the age of these collections, rather than being in WARC format, they were actually in the earlier (now depreciated) ARC format. But still, we were able to generate results quickly.
After two hours of Jimmy painstakingly hunting down some malfunctioning ARC – the web archival container format – files (the juicy details on how we’re going to fix that can be found here), the analysis began.
Within five minutes, we had useful scholarly derivatives and were already raising research questions. Continue reading
Me presenting our final datathon projects at the closing symposium.
Last week, I had the pleasure of co-hosting our “Archives Unleashed 2.0 Hackathon” at the Library of Congress, along with Matthew Weber (Rutgers), Jimmy Lin (Waterloo), Nathalie Casemajor (Université du Québec en Outaouais), and Nicholas Worby (Toronto). While a lot of our time was taken up by facilitating the smooth running of the event – providing virtual machines, ensuring people had great test datasets, making sure that people knew when fresh coffee arrived – we also had time to participate and hack within some of the teams.
Why did this datathon matter?
I was asked to give a short presentation about the datathon to the Saving the Web Symposium, organized by Dame Wendy Hall and the Kluge Center immediately following our hackathon. Continue reading