Running Shine Locally on a Collection of ARC/WARC Files

TL;DR? You can find a walkthrough here. To find out what it does, read on.

As Twitter followers will know, I’ve been playing with the UK Web Archive’s Shine front-end over the last week or so. I think it’s a fantastic front-end to a collection, and helps you get both a birds-eyes view of a collection and to have the ability to dive into concordances or the pages themselves. I’m currently indexing the material in the University of Toronto’s Canadian Political Parties and Political Interest Groups Archive-It collection and exploring it.

Results are still indexing (I’ve got about 10% left) so there may be changes to the data below, but it allows us to do things like this:

Screen Shot 2015-05-27 at 9.40.08 AM

Screen Shot 2015-05-27 at 9.22.27 AM

Screen Shot 2015-05-27 at 9.34.16 AM

Screen Shot 2015-05-27 at 9.35.40 AM

Observers of the Canadian political scene may find some of the above examples intriguing Again, once indexing is done tonight and I’m able to port it over to another machine, I’ll put some results up. Continue reading

International Internet Preservation Consortium Annual General Meeting 2015: Recap

(x-posted from Web Archives for Historians)

logoI had a fantastic time at the International Internet Preservation Consortium’s Annual General Meeting this year, held on the beautiful campus of Stanford University (with a day trip down to the Internet Archive in San Francisco). It’s hard to write these sorts of recaps: I had such an amazing time, my head filled with great ideas, that it’s difficult to give everything the justice that they deserve. Many of the presentation slide decks are available on the schedule, and videos will be forthcoming.

My main takeaways: we’re continuing to see the development of sophisticated access tools to these repositories, coupled with increasingly exciting and sophisticated researcher use of them. There’s a recognition that context matters when understanding archived webpages, a phrase that came up a few times throughout the event. Crucially, there was a lot of energy in the room: there’s a real enthusiasm towards making these as accessible as possible and facilitating their use. I wasn’t exaggerating when I noted to one of the organizers that I wish every conference was like this: leaving me on my flight home with lots of fantastic ideas, hope for the future, and excitement about what can be done. As the recent “Conference Manifesto” in the New York Times noted, that’s not the experience at all conferences!

Read one for a short day-by-day breakdown, with apologies for presentations I couldn’t include or didn’t give full justice to: Continue reading

Using Longitudinal Link Structures to Look at Three Major Canadian Political Parties

Another day, another short post. We’ve been working with Jimmy Lin‘s cluster at the University of Maryland – Jimmy’s been helping us run pig scripts and get access to things. Some more results have been illuminating.

The first script that we ran pulled out all the links to social media platforms (YouTube, Twitter, and Facebook) and aggregated them by top-level domains. This means that if say linked to, and so did and, the resulting chart would say linked to three times. Repeat that over 11,620,105 sites collected by the University of Toronto between 2005 and 2009 and you get some neat results. Here we’re looking at Canadian political parties and interest groups.

It goes without saying that none of this could happen without collaboration. My desktop would still be crunching along if running locally on that first question, probably.  Continue reading

A Webarchiving Short Story: The Liberal Party of Canada, 2006-2008


We’ve been playing with Jimmy Lin’s warcbase a bit more, and have been extracting all links from a series of WARCs. Here I decided to zoom in on the Liberal Party of Canada’s website, drag the websites in its modularity class close to it, and see how the links can tell us the story of the party.

In short: we see the election of a new leader (Stephanie Dion), the announcement of his new plan (the Green Shift), the pre-election fundraising (VictoryFund), the rise of an attack ad industry, and the end of it.

Here are the frames below: Continue reading

Using Mathematica to Generate Plain Text Files of Mirrored Web Site Archives

I’ve been spending a bit of time playing with the GeoCities Torrent again (in anticipation of being able to compare it to some of the WARC Files), with an eye on a few articles. One script in particular, derived from an old StackOverflow question I asked three years ago, has been extremely helpful. In short, given a directory of websites like so:


And all of the other files – gifs, html, subdirectories, nested subfolders to nested subfolders, I often just want to play with some of the plain text! Luckily, Mathematica has a powerful Import command that can convert HTML into generally good human-readable plain text. For some of my clustering, topic modelling, etc., where I don’t want links or stylistic information, this is really useful.

This isn’t for standard web archives, but more for the torrents, end-of-life dumps, and other things that I find myself working with when I’m playing in the wild.

It’s simple, and I use it a ton, so figured this might help you.

The modules:

mapFileNames[source_, filenames_, target_] := 
 Module[{depth = FileNameDepth1}, 
  FileNameJoin[{target, FileNameDrop[#, depth]}] & /@ filenames]

htmlTreeToPlainText[source_, target_] := 
 Module[{htmlFiles, textFiles, targetDirs}, 
  htmlFiles = FileNames["*.html", source, Infinity]; 
  textFiles = 
    f__ ~~ ".html" ~~ EndOfString :> f ~~ ".txt"]; 
  targetDirs = DeleteDuplicates[FileNameDrop[#, -1] & /@ textFiles]; 
   DeleteDirectory[target, DeleteContents -> True]]; 
  Scan[CreateDirectory[#, CreateIntermediateDirectories -> True] &, 
  Scan[Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &, 
   Transpose[{htmlFiles, textFiles}]]]

And then you run it by setting two variables:

origin=directory where the directory structure is


target=directory where you want the replicated plain-text directory structure to reside.

and run it:

htmlTreeToPlainText[origin, target];

Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.