Using Mathematica to Generate Plain Text Files of Mirrored Web Site Archives

I’ve been spending a bit of time playing with the GeoCities Torrent again (in anticipation of being able to compare it to some of the WARC Files), with an eye on a few articles. One script in particular, derived from an old StackOverflow question I asked three years ago, has been extremely helpful. In short, given a directory of websites like so:


And all of the other files – gifs, html, subdirectories, nested subfolders to nested subfolders, I often just want to play with some of the plain text! Luckily, Mathematica has a powerful Import command that can convert HTML into generally good human-readable plain text. For some of my clustering, topic modelling, etc., where I don’t want links or stylistic information, this is really useful.

This isn’t for standard web archives, but more for the torrents, end-of-life dumps, and other things that I find myself working with when I’m playing in the wild.

It’s simple, and I use it a ton, so figured this might help you.

The modules:

mapFileNames[source_, filenames_, target_] := 
 Module[{depth = FileNameDepth1}, 
  FileNameJoin[{target, FileNameDrop[#, depth]}] & /@ filenames]

htmlTreeToPlainText[source_, target_] := 
 Module[{htmlFiles, textFiles, targetDirs}, 
  htmlFiles = FileNames["*.html", source, Infinity]; 
  textFiles = 
    f__ ~~ ".html" ~~ EndOfString :> f ~~ ".txt"]; 
  targetDirs = DeleteDuplicates[FileNameDrop[#, -1] & /@ textFiles]; 
   DeleteDirectory[target, DeleteContents -> True]]; 
  Scan[CreateDirectory[#, CreateIntermediateDirectories -> True] &, 
  Scan[Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &, 
   Transpose[{htmlFiles, textFiles}]]]

And then you run it by setting two variables:

origin=directory where the directory structure is


target=directory where you want the replicated plain-text directory structure to reside.

and run it:

htmlTreeToPlainText[origin, target];

Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.

Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities

I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell). 

If you want to see the poster, please click here.


Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.

GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.

These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Continue reading

Rebel Youth Shortlisted for the Macdonald Prize

3015.01 Milligan 4ppWhile this blog has mostly been focusing on some of my new work lately, my first book – which grew out of my doctoral dissertation and which was published last summer – has been getting some good early positive buzz.

Most importantly, I learned on Wednesday that my book Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada was shortlisted for the Sir John A Macdonald Prize, which is given to the best book of Canadian historical non-fiction. You can read about it (and the four other books that made the cut) here. I’m shocked and honoured to be in such esteemed company. The final results will be announced in early June at the Canadian Historical Association’s annual meeting. Hope to see some of you there.

Otherwise, the first few reviews have come in… and they’re positive so far (I know that this will change). BC Studies has a positive advance review up on their website. Blacklock’s Reporter was first out the gate in their review, writing: “Ian Milligan, professor of history at the University of Waterloo, corrects the record through meticulous research and interviews.” Briarpatch Magazine includes it in a review of three books from UBC Press, noting that the books serve as “important reminders that history is not easily bookended by the arbitrary markers of decades and is instead a living process that holds new discoveries for those willing to dig deeper and learn vital lessons for today’s struggles.” Finally, Choice Reviews Online has given it a “highly recommended” rating in their review, arguing that “[t]his clearly and accessibly written book is a wonderful contribution to the history of the turbulence of the 1960s.”

Forgive my horn tooting. We’ll be back to our regularly-scheduled programming soon – and I actually have some pretty cool news to share with everybody in a few weeks (he says cryptically).

Adapting the Web Archive Analysis Workshop to Longitudinal Gephi:

The Problem

This script helps us get longitudinal link analysis working in Gephi

This script helps us get longitudinal link analysis working in Gephi

When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.

Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:

[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]

Continue reading

Using Warcbase to Generate a Link Graph of the Wide Web Scrape

1,719,167 links between roughly 622,000 .ca websites captured in 2011.

1,719,167 links between roughly 622,000 .ca websites captured in 2011.

In this short post, I want to take you from a collection of WARC files to a webgraph, which you can see pictured at left.

If you’ve been following me on Twitter, you know that I’ve been playing around with warcbase, part of my overall exploration into web archives. Warcbase is an “open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge.” While I’m not terribly fluent at these platforms – getting better at them is one of the goals of my upcoming junior sabbatical – Jeremy Wiebe, who’s been working with me on this project, is. He wrote a tutorial, “Building and Running Warcbase under OS X” for me.

Using Wiebe’s tutorial, you can ingest data into HBase and do a variety of things with it. For this task, I focused on the last section labelled ‘Pig Integration.’ If you want to ingest WARC files rather than ARC files, you need to call “WarcLoader” instead of “ArcLoader” on line 3. You then set your directory with WARCs on line 6, and your output directory on line 12. See: Continue reading