This is the third part in a series of posts, including an introduction to WARC files and some early work in Mathematica, to a discussion of WARC Tools, to this one that builds a full text index. It’s all part of my attempt to understand WARC files, which I think will be the archive boxes for historians within 10 or 15 years. Edited to add, please refer to part two to see some changes to the code.
It took a bit of time, but I now have a full-text searchable index of a WARC file as well as a finding aid. It’s starting to look feasible to take Internet Archive material and begin running text mining on large quantities of WARC files, without the delays inherent in using Mathematica with WARC files (plus, the full text can then be put into programs as diverse as Mathematica, SEASR, or even incorporated into my Zotero database).
In the last post, we had created filtered record files of our WARC files and began to look at the WARC indexes. In this post, I want to bring us forward to the final product. How can we do this? In short, by following the directions here under “Full-text search index.” (edited to add: I spoke with Thomas Edward Figg via twitter who informs me that Mandel is now seeking to incorporate this functionality into their suite!)
I ran into a few issues. First, you need to download Lynx to format the full-text archive. Lynx is an old school text-only browser, which is great for nostalgia trips – it took me back to Grade 7 library sessions from 1994. Lynx might present a few problems for OS X users, namely that the place one might think to download it (apple.com) actually hosts a version that is only suitable for the old PowerPC architecture (if you did accidentally install it at any point as I did, you’ll have to remove it – in your terminal,
which lynx will give you the path of its install, and
rm [that path] will delete it – you may need to prepend
sudo to give adequate permission). Instead, I used LynxLet.
Download LynxLet and follow the install directions. In your terminal window, make sure to make sure that the command
Lynx can call the program, by typing
alias lynx='/Applications/Lynxlet.app/Contents/Resources/lynx/bin/lynx'. Double check that it works.
With Lynx installed, we’re almost ready to build a full text index. Using the Mandal tools, re-run the commands we did yesterday on the im.warc file (or whatever you’re using):
warcfilter -T response im.warc > filtered.warc
warchtmlindex.py filtered.warc > index.html
Now let’s go a bit further. First, create a directory labeled html in the directory you’re working in. You can do this in the terminal with
mkdir html or in your finder if you happen to be in it.
As is, with LynxLet, the default filesdump.py program won’t work. By following the error output, I found the issue. Open record.py, which is in your /warctools subfolder of the mandel tools, using your Python editor of choice (I use Komodo Edit). Create a backup. Scroll down to line 140, and there you’ll see the call to Lynx. It’s including the command
-unique_url, which LynxLet does not support. Once you eliminate that, we should be good to go.
Save, return to your terminal, and run the following command (replacing my path below with here you’ve got your warc tools).
python /USER/warc/warc-tools-mandel/filesdump.py filtered.warc
If it works, you’ll see a long list of html files similar to left. These are the local files of your WARC archive. You can now open up two main files that you’ll see. First, you can open up a file named fulltext.html – this has the FULL TEXT of your WARC archive. From here, you can do your usual textual analysis stuff – whether it’s in Python a la the Programming Historian 2, Mathematica, SEASR, whatever. It’s all together and we can start learning neat things about the past.
Additionally, that index.html file from earlier now works. Click on any of the files and you’ll be brought to a full text version of the page. They’re all formatted like they would be in Lynx, which is actually very readable.
So there you have it. From WARC file to full text searchable index. Good luck! I put all this up in case somebody else wants to do it and runs into the same issues that I did.