I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.
Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).
What does this mean? Basically, over the next year I’ll be hosting the following public events in Toronto. This will primarily be taking place in the January – May timeframe, and I will be in Toronto roughly once-a-week during this period. It is also an excuse to be physically proximate to great collaborators: folks at the DCI, Toronto libraries (especially Nich Worby who I’ve worked with quite a bit), and York (where my frequent collaborator Nick Ruest is based).
Give an Invited Lecture: I’ll be giving a Coach House Institute lecture on the findings of the Fellowship research project, discussed below;
Organize a Marquee Event: I’d like to help the DCI with bringing in a high-profile invited speaker to discuss web archiving. Maybe I can score some free canapés.
Most importantly, I’ll be carrying out a research project on qualitative comparisons of web archival content, specifically the kinds of content curated using a social media approach versus a manually-curated professional one. (more…)
As part of the Archives Unleashed hackathon, the Library of Congress graciously provided access to several of their collections. Jimmy Lin and myself worked with one of the teams, “The Supremes,” to see if we could generate useful scholarly derivatives from the underlying collections.
The team was called “The Supremes” for an apt reason: we worked with web archival data around the nominations for Justice Alito and Justice Roberts. These were two nominations that began in 2005, and contained legal blogs, Senatorial discussions, and other content relevant to those nominations.
As it was a datathon with limited time and resources, we used data subsets:
Alito – 51 GB, 1.8 million records, 1.2 million pages
Roberts – 41 GB – 1.4 million records, 1.0 million pages
Given the age of these collections, rather than being in WARC format, they were actually in the earlier (now depreciated) ARC format. But still, we were able to generate results quickly.
Last week, I had the pleasure of co-hosting our “Archives Unleashed 2.0 Hackathon” at the Library of Congress, along with Matthew Weber (Rutgers), Jimmy Lin (Waterloo), Nathalie Casemajor (Université du Québec en Outaouais), and Nicholas Worby (Toronto). While a lot of our time was taken up by facilitating the smooth running of the event – providing virtual machines, ensuring people had great test datasets, making sure that people knew when fresh coffee arrived – we also had time to participate and hack within some of the teams.
Why did this datathon matter?
I was asked to give a short presentation about the datathon to the Saving the Web Symposium, organized by Dame Wendy Hall and the Kluge Center immediately following our hackathon. (more…)
Nick Ruest and myself have a piece that’s just come out in Code4Lib Journal. The article takes readers through the (a) why Twitter matters for event archiving and future historical research; (b) how you can collect data yourself; and (c) how you can analyze the data. You can read the abstract below, and check out the article here!
As always, hope you enjoy reading it, and if you have any comments, questions, or anything, we are always happy to hear from you.
Nick Ruest, Anna St-Onge, and myself have a piece that’s just come out in the open-access journal Digital Studies / Le champ numérique. The deliberately acronym-heavy title introduces an article that really takes us through the process of (a) creating a web archive; (b) preserving and providing access to the files; and (c) running some basic analysis on it from the perspective of a historian. While some of the text analysis done in the rear bit of the article predates more recent warcbase developments, I think it hopefully provides a great and useful conceptual approach.