WARC Files, Part Three: Building a Full-Text Index

This is the third part in a series of posts, including an introduction to WARC files and some early work in Mathematica, to a discussion of WARC Tools, to this one that builds a full text index. It’s all part of my attempt to understand WARC files, which I think will be the archive boxes for historians within 10 or 15 years. Edited to add, please refer to part two to see some changes to the code.

A Full Text WARC file, formatted in Lynx, with good metadata.
A Full Text WARC file, formatted in Lynx, with good metadata.

It took a bit of time, but I now have a full-text searchable index of a WARC file as well as a finding aid. It’s starting to look feasible to take Internet Archive material and begin running text mining on large quantities of WARC files, without the delays inherent in using Mathematica with WARC files (plus, the full text can then be put into programs as diverse as Mathematica, SEASR, or even incorporated into my Zotero database).

In the last post, we had created filtered record files of our WARC files and began to look at the WARC indexes. In this post, I want to bring us forward to the final product. How can we do this? In short, by following the directions here under “Full-text search index.” (edited to add: I spoke with Thomas Edward Figg via twitter who informs me that Mandel is now seeking to incorporate this functionality into their suite!)

Screen Shot 2012-12-13 at 4.14.13 PMI ran into a few issues. First, you need to download Lynx to format the full-text archive. Lynx is an old school text-only browser, which is great for nostalgia trips – it took me back to Grade 7 library sessions from 1994. Lynx might present a few problems for OS X users, namely that the place one might think to download it (apple.com) actually hosts a version that is only suitable for the old PowerPC architecture (if you did accidentally install it at any point as I did, you’ll have to remove it – in your terminal, which lynx will give you the path of its install, and rm [that path] will delete it – you may need to prepend sudo to give adequate permission). Instead, I used LynxLet.

Download LynxLet and follow the install directions. In your terminal window, make sure to make sure that the command Lynx can call the program, by typing alias lynx='/Applications/Lynxlet.app/Contents/Resources/lynx/bin/lynx'. Double check that it works.

With Lynx installed, we’re almost ready to build a full text index. Using the Mandal tools, re-run the commands we did yesterday on the im.warc file (or whatever you’re using):

warcfilter -T response im.warc > filtered.warc
warchtmlindex.py filtered.warc > index.html

Now let’s go a bit further. First, create a directory labeled html in the directory you’re working in. You can do this in the terminal with mkdir html or in your finder if you happen to be in it.

As is, with LynxLet, the default filesdump.py program won’t work. By following the error output, I found the issue. Open record.py, which is in your /warctools subfolder of the mandel tools, using your Python editor of choice (I use Komodo Edit). Create a backup. Scroll down to line 140, and there you’ll see the call to Lynx. It’s including the command -unique_url, which LynxLet does not support. Once you eliminate that, we should be good to go.

Save, return to your terminal, and run the following command (replacing my path below with here you’ve got your warc tools).

python /USER/warc/warc-tools-mandel/filesdump.py filtered.warc

Screen Shot 2012-12-14 at 9.04.53 AMIf it works, you’ll see a long list of html files similar to left. These are the local files of your WARC archive. You can now open up two main files that you’ll see. First, you can open up a file named fulltext.html – this has the FULL TEXT of your WARC archive. From here, you can do your usual textual analysis stuff – whether it’s in Python a la the Programming Historian 2Mathematica, SEASR, whatever. It’s all together and we can start learning neat things about the past.

Additionally, that index.html file from earlier now works. Click on any of the files and you’ll be brought to a full text version of the page. They’re all formatted like they would be in Lynx, which is actually very readable.

So there you have it. From WARC file to full text searchable index. Good luck! I put all this up in case somebody else wants to do it and runs into the same issues that I did.

One thought on “WARC Files, Part Three: Building a Full-Text Index

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s