Here is the rough text of what I’ll be presenting at the CHA. I tend to ad-lib a bit, but this should give you a sense of what I’m presenting on. It’s a twenty minute timeslot to a general audience. As I noted earlier, there is a full-text paper that drives the presentation. If you want it, drop me a line.

CHA 2013.001

Hi everybody and thank you for coming to my talk, “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive files.” I want to begin with something that I think we’ll all find familiar.
CHA 2013.002
Archival boxes come in different shapes and sizes, but are a familiar sight to historians. Generations of thought and design have gone into these boxes: they are specifically designed to protect documents over a long time period, reduce acidity, and they can withstand considerable physical wear and tear to avoid having to replace them. No fewer than six International Standards Organizations (ISO) specifications go into the creation and maintenance of physical archives. Historians played a large role in the establishment of the archival profession, a voice that has been supplanted in recent years by the rise of library and information schools. (more…)

Screengrab of my conference presentation opening slide, just saying "The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files."Next Monday, June 3rd at the University of Victoria as part of the Canadian Historical Association’s Annual Meeting [PDF program here], I’ll be presenting a twenty-minute paper entitled “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files.” It’ll be a hopefully entertaining (at the very least provocative) romp through the history of the Internet Archive, the promises and potentials (pausing to highlight issues of scope, technical limitations, and ethical considerations), before taking us through two workflows to open our own web archival files.

As I’ll note there, and have elsewhere, this is my attempt to bring a historical end user’s perspective to bear on this issue. There’s a enormous decently-sized conversation about web archives out there, and I want more historians to be at the table to chat about it.

I won’t be reading my paper, as I don’t like that as an audience member, and figured about a year and a half ago that I’d stop going with the pack. The presentation will be up at some point in mid June when I am back from conference travelling.

However, even if I won’t be reading the paper, you can. If you’re a CHA member, it’ll be up on their website. If not, shoot me an e-mail or Twitter @ or DM and I’ll send you a copy (with the usual provisos that it’s a draft, work-in-progress, don’t cite, will be considerably re-worked, etc. etc.).

Output
Output

After a great day yesterday at Code4Lib North, and playing with some of Nick Ruest’s WARC files from the ‘Free Dale Askey’ collection, I’ve put everything together with the bash and Mathematica script.

I’ll be playing with some of this stuff in two weeks at the Canadian Historical Association‘s annual meeting in Victoria, in my presentation entitled “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files.” Slides and text’ll be up later, and I will also make the full text paper available to interested parties.

I see a potential use for this in collections of WARC files where there are no finding aids, just a bunch of files: which I think can be relatively common in ‘just-in-time’ grabs or in other case studies. In the best case, I think the script used for the Dale Askey collection is great – each file has an attached screenshot and PDF in addition to the WARC – but that this has a role for cases where little information is attached (i.e. more useful for cases like this one than this one with more data).

Here’s how it works. Unfortunately, you need Mathematica.

Two files:
All-Together-MMA.sh: Takes a WARC file passed to it from the command line, generates full-text, topic models it, and then invokes…
WARC-to-Analysis-single-file.m: a Mathematica script that generates the PDF file discussed in the last post.

How to run it:

On command line, once made executable (otherwise prepend sh):
./all-together-mma.sh 00016-2013_02_23.warc

This takes the WARC file, one of the Dale Askey collection, and runs it through the script. With proper directories set in the files, it generates output as above in one step. A big benefit of this is that I can now automate this across a ton of WARC files.

Work to do:

– Need to refine stop words
– Topic models are set up for large corpuses, so running 50 topics on a single page is overkill.
– Sparklines are set up for large corpuses as well, so output is weird on one page. But still can be moderately useful.
– Integrate with AlchemyAPI for sentiment analysis? Multiple KWICs?

This program creates a PDF like this from a WARC file, building on previous work.
This program creates a PDF like this from a WARC file, building on previous work.
My earlier workflow took a WebARChive (WARC) file (or generated one based on a website) and, using older version of WARC-Tools, generated a full text index. It finished by generating an output in the Stanford Termite browser. Given the size of these archives, and the sheer amount of text within them, I believe that these visualizations help us ‘see through the box,’ as it were, and ascertain their relevancy for our research topics.

I also have the ulterior motive of getting historians more involved in this topic (there are certainly a few already working in the field) – we were around to design the first generations of archival boxes (historians were fundamental in the early days of the archival profession), and want us around as we tackle the second generation of the web archive box.

Today, using Mathematica, I developed my script further. It can be called from the command line if you have Mathematicaand so can be built into the aforementioned workflow. It does the following:

In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.
In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.

– takes the fulltext file and topic modelling data generated by the previous script;
– generates word frequency information and displays them in a word cloud;
– provides Keyword-in-Context of specific words (or, if run from Mathematica, can provide dynamic information as seen at right);
– and visualizes the topic models, in declining order of overall prominence within the WARC file, using sparklines to demonstrate whether it is evenly spread throughout the file or just in a few files.

It then generates a long PDF file that you could store alongside the file, or use – with the specific keywords that you’re using – in an attempt to ascertain whether a given WARC file is handy for your research. It also shows how we have moved from large, ungainly WARC files into an area where we can apply text mining tools to them. Click through to see the final output: (more…)

Screen Shot 2013-05-16 at 5.52.32 PMPut this in the “another reason why it’s good to blog” pile. My blog post from last month, which argued that Yahoo! Messages was consciously destroying a fifteen-year swath of history, led Nature magazine to ask if I would submit it in condensed format to their Correspondence section (behind a paywall, unfortunately, but you’ll be able to read it if you’re on an institutional IP connection). Click here to view it.

I submitted it, and after further pruning by their staff (an aside to this aside: great editorial staff, from the contact people to the copy editors), a very short version appeared in today’s issue.

It’s almost certainly the shortest thing I’ve ever written, but I like to think that it’ll get some readers and hopefully encourage more  researchers to hop on the digital preservation bandwagon.

Thanks to all my readers who reblogged and retweeted my earlier post, which helped spread it around the Internet.

Time to share some tinkering, in the hopes that somebody, somewhere, might find it helpful. 🙂

Regular readers will know that I’m interested in how historians can approach web archives, as discussed in a three part series in late 2012 (see part one, two, and three). As I’ve stressed, in both tweets and in some draft writing: Historians need to understand web archives, however, as we will be professional end users of these archives.  We played a critical role in shaping the modern practice of traditional archiving. Let us make sure that historians are present for the next step. There’s a conversation, but its largely amongst people involved in web archiving as creators rather than as users.

[if you want to skip to my code, it’s here]

So here’s me positing a problem: Some web archives do not have description, so you aren’t sure what you’re going to find inside. This includes some just-in-time web saves, like this mirror of the Montreal Mirror’s website. There’s always an item listing, automatically generated, that lets you know what exactly is in the website. When dealing with wide web crawl data, part of that massive 80TB dataset, this is a life saver. Very briefly: Web ARChive files are complicated containers of multiple files – ActiveHistory.ca, for example, is made up of over 18,000 files. That’d choke a file system, but you can turn it into one Web ARChive that you can play with later.

Furthermore, these WARC files are too big. Wouldn’t it be nice if you could, at a glance, see what ianmilligan.ca is about without having to read what I’m writing here? (yes, but also imagine if you were just looking at a bunch of visualizations – would be invaluable in a research project)

But for the historian, it’s not terribly useful in and of itself. What if we had a lot of these files, how could we quickly see what related to our topic, and what didn’t? Would we be able to automate it?

Here’s my idea, which I cooked up as a way to learn some more technical skills and invent a tool to help me in my workflow. What if we could take a Web ARChive file (WARC) and then hook it up to a topic modelling visualizer, like Stanford’s Termite [see their paper here]? (more…)