Search Engine on the WARC Collection: An Update on my Project

Well, just in time for vacation to start this evening (I will be off work until the following Monday, 12 August – not that I have a good track record of actually taking time completely off), I managed to get a Solr search engine up and running on my collection of somewhere between 3-5% of the Canadian internet (there’s a wide range here for reasons I’ll get into later).

In this post, I want to show what we can do, and then want to briefly document how I got here.

Again, this is my first pass through the data. I’m going to do it all again, with lessons learned, later this month. It’s also purely exploratory at this point: what can a historian do with a massive Big Data set like the 80TB Wide Scrape (discussed in my last blog post)?

Solr running on WARC Plain Text Files

Solr running on WARC Plain Text Files

What can we do now?

We can submit simple queries through the standard Solr dashboard, as pictured at left. Right now we’re limited to just URL names and text, but as I’m working out of XML files there’s the possibility to set up regular expressions to extract other standardized date from it.

Snippet from Carrot interface, query 'Waterloo'

Snippet from Carrot interface, query ‘Waterloo’

We can also submit queries through the Carrot workbench. Clustering isn’t up and running yet, which’ll have to wait until I’m back in the office, but this lets us click right through to the full text document and see what it contains.

Searching for ‘Waterloo’ brings up job advertisements, math contests, and essay mills. If we had multiple scrapes, you could establish change over time, especially with respects to the economy.

So in short, we’ve taken massive WARC.GZ files and turned them into something we can play with. The biggest shortcoming of the Internet Archive for some of my work is that it doesn’t have full text search, and you can’t do that much data mining on it. Well, through these files – you can!

This is a good way to kick off the vacation, knowing that in a few months the dataset is starting to come together. And that’s before spending more than a $1,000 of a (secret and still unannounced, although I have the money) grant on this stuff!

How did I get here?

I’ve been documenting my research flow, so why not put it up here all in one place.

Step One: Downloaded the Digital Finding aid, which were CDX Files

Basically I ran this command, where wide00002.txt referred to a list of all the Wide Scrape files found on the Internet Archive (refer to the bulk downloading with wget lesson on the IA blog for more info):

wget -r -H -nc -np -nH --cut-dirs=2 -A .cdx.gz -e robots=off -l1 -i ./wide00002.txt -B 'http://archive.org/download/'

This resulted in 111,690 files and 349GB of plain text, a record of every URL contained within the 80TB wide scrape.

Step Two: Found the Canadian Domains

I blogged about this here, but as a refresher the CDX file was full of entries like this:

{za,co,ireneguesthouse)/robots.txt 20110815151754 http://www.ireneguesthouse.co.za/robots.txt text/html 404 AL3MTEPBMLJDH2EWR6LJ5XKO3HSY6XSQ - - 952 47026507 WIDE-20110815150840-crawl415/WIDE-20110815150840-00471.warc.gz}

This is in a .za domain. So I decided to count for .ca.

This created output like this from my own program:

WIDE-20110309002853-crawl337/WIDE-20110309002853-crawl337.cdx.gz,1313
WIDE-20110309005125-crawl338/WIDE-20110309005125-crawl338.cdx.gz,1084
WIDE-20110309035435-crawl338/WIDE-20110309035435-crawl338.cdx.gz,657
WIDE-20110311214209-crawl338/WIDE-20110311214209-crawl338.cdx.gz,184
WIDE-20110311214225-crawl337/WIDE-20110311214225-crawl337.cdx.gz,2111

And so on.. I sorted this by the second number, so I could establish a good case study of these files. They corresponded to WARC files, so I could again turn to wget and grab them.

Step Three: Downloaded WARC Files with Generous Amounts of Canadian Content

So with the file identifiers, I could run wget on these again. I pulled down about 100GB of data. Too much for me to handle, especially as a (as of now) one-person research team on a laptop. Plus I’m a text analysis scholar.

Step Four: Plain Text

Remember WARC Tools? I’ve blogged about it before and put some stuff up on GitHub.

Basically, I took these WARCs and ran them through Lynx (via these tools). A 1GB WARC file, the basic building block of this scrape, reduces to on average around 150MB when it’s turned into just the readable text.

This reduces our 100GB collection to 15.93 GB. Becoming manageable now!

This is still mostly non-Canadian though. So the next step was to use a Regular Expression, go through each and every URL in these 15.93GB of text files and find only Canadian domains. As it’s extracted in a regular format, this isn’t too difficult.

Step Five: Burst Back to Individual Websites

Now that we have big chunks of Canadian websites, the final step was to burst these out back into individual website files. So that http://uottawa.ca/blah/blah becomes xxxx.txt. I’m not too concerned about the file name at this point, but might refine if I do this again.

Step Six: Convert these to XML

Solr wants fields. So as a test, I turned each website into a VERY simple XML file:

<website>
<title>http://test.ca/index.html</title&gt;
<date>2011,09,08,07,05,04,01</date>
<body>Blah blah blah</body>
</website>

I will add more fields the next time through, but this was a starter.

Step Seven: Ingest into Solr

Now we can search, and we have some fields.

There’s work to be done, but for the first time through, this has potential. We can run text analysis on these files now, do some searching, and I think we’re getting to the point where we can have some fun with this over the Fall semester!

4 thoughts on “Search Engine on the WARC Collection: An Update on my Project

  1. christophermfryer says:

    Really fascinating work Ian. Will be interesting to see how this develops. The future uses for this are really exciting – just think what you can do with all that data and analysis!

    Chris

    • Ian Milligan says:

      Thanks Chris, much appreciated. I’m hoping to develop it further over the next few months to get a better sense of what potential we can unlock in these files.

      Hope you have a great weekend!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s