WARC Analysis Using Mathematica

This program creates a PDF like this from a WARC file, building on previous work.

This program creates a PDF like this from a WARC file, building on previous work.

My earlier workflow took a WebARChive (WARC) file (or generated one based on a website) and, using older version of WARC-Tools, generated a full text index. It finished by generating an output in the Stanford Termite browser. Given the size of these archives, and the sheer amount of text within them, I believe that these visualizations help us ‘see through the box,’ as it were, and ascertain their relevancy for our research topics.

I also have the ulterior motive of getting historians more involved in this topic (there are certainly a few already working in the field) – we were around to design the first generations of archival boxes (historians were fundamental in the early days of the archival profession), and want us around as we tackle the second generation of the web archive box.

Today, using Mathematica, I developed my script further. It can be called from the command line if you have Mathematicaand so can be built into the aforementioned workflow. It does the following:

In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.

In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.

– takes the fulltext file and topic modelling data generated by the previous script;
– generates word frequency information and displays them in a word cloud;
– provides Keyword-in-Context of specific words (or, if run from Mathematica, can provide dynamic information as seen at right);
– and visualizes the topic models, in declining order of overall prominence within the WARC file, using sparklines to demonstrate whether it is evenly spread throughout the file or just in a few files.

It then generates a long PDF file that you could store alongside the file, or use – with the specific keywords that you’re using – in an attempt to ascertain whether a given WARC file is handy for your research. It also shows how we have moved from large, ungainly WARC files into an area where we can apply text mining tools to them. Click through to see the final output:

Download a trial-1 here, or see below for a graphical version. Code is at github:

trial-1

Work still remains to be done, of course, especially on cleaning up the text that’s going into the system. That’s for tomorrow, however.

If anybody has anything they think would be a good addition, or other things you would like to see, feel free to comment, e-mail, or tweet me.

One thought on “WARC Analysis Using Mathematica

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s