After a great day yesterday at Code4Lib North, and playing with some of Nick Ruest’s WARC files from the ‘Free Dale Askey’ collection, I’ve put everything together with the bash and Mathematica script.
I’ll be playing with some of this stuff in two weeks at the Canadian Historical Association‘s annual meeting in Victoria, in my presentation entitled “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files.” Slides and text’ll be up later, and I will also make the full text paper available to interested parties.
I see a potential use for this in collections of WARC files where there are no finding aids, just a bunch of files: which I think can be relatively common in ‘just-in-time’ grabs or in other case studies. In the best case, I think the script used for the Dale Askey collection is great – each file has an attached screenshot and PDF in addition to the WARC – but that this has a role for cases where little information is attached (i.e. more useful for cases like this one than this one with more data).
Here’s how it works. Unfortunately, you need Mathematica.
– All-Together-MMA.sh: Takes a WARC file passed to it from the command line, generates full-text, topic models it, and then invokes…
– WARC-to-Analysis-single-file.m: a Mathematica script that generates the PDF file discussed in the last post.
How to run it:
On command line, once made executable (otherwise prepend sh):
This takes the WARC file, one of the Dale Askey collection, and runs it through the script. With proper directories set in the files, it generates output as above in one step. A big benefit of this is that I can now automate this across a ton of WARC files.
Work to do:
– Need to refine stop words
– Topic models are set up for large corpuses, so running 50 topics on a single page is overkill.
– Sparklines are set up for large corpuses as well, so output is weird on one page. But still can be moderately useful.
– Integrate with AlchemyAPI for sentiment analysis? Multiple KWICs?