My earlier workflow took a WebARChive (WARC) file (or generated one based on a website) and, using older version of WARC-Tools, generated a full text index. It finished by generating an output in the Stanford Termite browser. Given the size of these archives, and the sheer amount of text within them, I believe that these visualizations help us ‘see through the box,’ as it were, and ascertain their relevancy for our research topics.
I also have the ulterior motive of getting historians more involved in this topic (there are certainly a few already working in the field) – we were around to design the first generations of archival boxes (historians were fundamental in the early days of the archival profession), and want us around as we tackle the second generation of the web archive box.
– takes the fulltext file and topic modelling data generated by the previous script;
– generates word frequency information and displays them in a word cloud;
– provides Keyword-in-Context of specific words (or, if run from Mathematica, can provide dynamic information as seen at right);
– and visualizes the topic models, in declining order of overall prominence within the WARC file, using sparklines to demonstrate whether it is evenly spread throughout the file or just in a few files.
It then generates a long PDF file that you could store alongside the file, or use – with the specific keywords that you’re using – in an attempt to ascertain whether a given WARC file is handy for your research. It also shows how we have moved from large, ungainly WARC files into an area where we can apply text mining tools to them. Click through to see the final output:
Work still remains to be done, of course, especially on cleaning up the text that’s going into the system. That’s for tomorrow, however.