On-the-Fly WARC Finding Aids with Mathematica and Stanford NER

A sample of the output, full version posted below.
I’ve had an entire day to mostly play with my research today (which is pretty rare in November). I had a chance to update a script I worked on a few month ago, which required some tinkering to get up and running under OS X 10.9 and still unfortunately at this point requires a Mathematica license, but as a concept I hope some might find it interesting.

Everything is up on my github here.

Basically, with pre-requisites installed, this is a script that with one command:

sh ./WARC-to-Analysis-ner.sh ianmilligan.warc "history"

Where ianmilligan.warc can be replaced by your own WARC file, and "history" by a keyword that you might be particularly interested in viewing in context.

It does the following:

  1. Turns your raw WARC file into a full-text searchable repository using WARC Tools;
  2. Uses MALLET to topic model your full text;
  3. Generates a PDF, using Mathematica, that can serve as a rudimentary overview of what that web archive contains. By default, you get the following:
    1. Web URL
    2. Date of WARC Scrape
    3. Short Text Preview
    4. Simple Word Cloud of Frequency
    5. Keyword in Context of the selected keyword
    6. Top 50 extracted people names
    7. Top 50 extracted location names
    8. Top 50 extracted organization names
  4. MALLET Topics, arranged with sparklines so that you can see how the topic is distributed throughout the archive. Is it a widely-distributed topic or is it present in only a few parts of the archive.

I am planning to test it out on the 80TB Wide Scrape as well as my GeoCities archive, so hopefully we’ll get some ‘in the field’ responses about whether this is useful or not! Edited to quickly note that the NER results are really, really messy – but I think at a glance they give you a sense of what an archive contains. Am looking forward to playing with this some more though.

Example output below:


