Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.
William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with webrecorder.io – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.
Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash. Continue reading “Reading WARC Records with Mathematica”