This is the second part in a series dealing with Internet Archive WARC files. The first introduced WARC files and why I think they matter. This post introduces WARC Tools, and the third moves from this into a discussion of how to create a full-text searchable database.
Edited: WARC Tools had some changes, so this works off of the version of WARC tools hosted on my GitHub at https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel. I would download that (you can clone my repo if you want and pick that out). When using this version, I find I just manually point to it – so commands look like:
python /Users/ianmilligan1/desktop/historian-warc-1/warc/warc-tools-mandel/filesdump.py filtered.warc. Any questions, don’t hesitate to let me know.
If anybody doubts the merits of blogging your ongoing research, this morning I woke up to an e-mail from the lead programmer of warc-tools, a Python library for working with WARC files. And with that, I now have another tool for my digital toolkit. In this quick post, I want to introduce the toolkit, how to quickly install it, and show off some of the things it can do.
Again, remember, I’m advancing the thesis that historians – sooner than we know it – are going to probably have to start dealing with WARC files as our new archives. Born digital sources are incredible resources, but right now we can’t access them easily (well, I’m sure I could go down to the library and beg for help, but again, I want to conserve that for when I really hit a wall).
Setting it up
Firstly, you’ll need Python 2.6 (I’m using Python 2.7, but like many DH tools I suspect the big one is not using Python 3). Luckily, the good folks (well, I’m now an editor-at-large there, so I guess it’s us now?) over at the Programming Historian 2 have a lesson on setting it up and everything you’ll need to get started on that front. If you’re on a newer Mac at least, you should already have Python set up which you can double check in your terminal window by typing
Download the WARC Tools here, by clicking the download tab on the right hand side. Some documentation is here at the wiki, but in short for me, my default install on this new laptop already had the sufficient dependencies. Put the libraries into a folder, navigate to it in your terminal, and run the following commands:
sudo ./setup.py install
You’ll need to provide your root password after running the second command, to make sure you’ve got authorization to install this.
Already, you’ve got a toolkit of commands that’ll work from anywhere!
Now, I actually found that I wanted a few more commands, so I found myself over at MandalWeb and found a “slightly extended version” of the commands here. They have one command that I find especially useful, which I’ll show you in a second. To install these, the process is similar – build, install (with sudo) and you’ll be good to go. It all seems to be working now, although my bash-fu is a bit limited (I’m one of those StackExchange kind of self taught type people).
In my previous post, I was using Mathematica to briefly unpack and look at WARC files. I found this useful, but last night was running into problems of scaling: WARC files are big, and it takes Mathematica a fairly substantial amount of time to even extract the plaintext from them. For a bigger project, we’d really need to find the specifically relevant information quickly. For this, these tools will be invaluable.
Let’s see what we can do here.
In your terminal, first of all, try typing:
You should see this:
Usage: warcindex [options] warc warc warc
warcindex: error: no imput warc file(s)
This shows that you’re good to go. So let’s try this out on a WARC file. Using wget, I created a few different WARC archives (see my previous post). Let’s try it out on my own website:
Even for this small website, your terminal window will quickly fill up! But you’re seeing everything that’s inside this WARC package. Very handy for getting a sense of what’s there. Luckily we can format this a little bit better! If we wanted to access it in CSV, we can do so:
warcindex im.warc > index.csv
Still not the most useful, but we’re now getting a record of what we have here.
Luckily, with the extended tools we have some cool help to draw on. Let’s make an HTML index of the documents.
warcfilter -T response im.warc > filtered.warc
This creates a filtered file, getting rid of the records created by the web crawler (which explains why when I was doing file dumps of the WARC, we were getting a lot of wget data). Then, here’s a good command:
warchtmlindex.py filtered.warc > index.html
When we open up this HTML file, we then have a index of the WARC file.
Note that there are hyperlinks to each file – currently they go nowhere. In the next post, I will hopefully take us through the full-text indexing – which would let us go right from WARC to individual files! Right now, it’s been taking the better part of an hour to get this working and random end-of-term work is calling.