Finding Specific Websites and Creating Full-Text Indexes Using WARC Tools

Some of my posts, like this one, are going to be a bit ‘boring.’ But that’s all part of working in public: a lot of what historians do is actually kind of dull. But it’s baby steps towards something, and hopefully, if somebody wanted to they could follow in my footsteps or build upon this.

Over the weekend, I downloaded ~100GB of specific WARC files pertaining to the Canadian (dot ca) World Wide Web. I decided to pick the “top ten” CDX files, and each of those CDX files covered ten individual compressed WARC files amounting to ~1GB each.

The web archive data, however, was not aggregated by domain (or any other one characteristic). As an Internet Archive staffer noted to me in an e-mail, these “WARCs contain data exactly as it was crawled.” Even in the WARC files containing the largest percentage of .ca websites, they still may only account for roughly 20-30% of the container. We want to get rid of the other stuff (a fair amount of spam and pornography, if my initial MALLET test was accurate).

I’m mindful, however, that I want to keep everything. Who knows if I may want to expand things down the line? So what I’m doing right now is running this 10GB scrape of the World Wide Web through WARC-Tools. On a MacBook Retina it should take about a day and a half, if my estimates are correct.

For each file then, I use WARC Tools (which I’ve written about before) to: [bash code here]
– create a filtered WARC file;
– create a searchable full-text index of the entire WARC file;
– and then create an aggregate fulltext file of the entire archive, broken down by one website per line (some 45k per 1GB file).

The next step is to take that last file, go through it and select all domains that contain .ca in them using Mathematica. In this way, we begin to get a somewhat manageable dataset. A 1GB WARC reduces to ~155MB in Plaintext after being run through Lynx, and will reduce even further to around 60-70MB for only domains. We can then run it through Mathematica and use our regular bag of tricks to find out what we’ve got in these containers!

2 thoughts on “Finding Specific Websites and Creating Full-Text Indexes Using WARC Tools

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s