Working Clustered Search on Wide Scrape WARC Collection

Well, things are coming together on my big project about working with Internet Archive holdings. In my last post, I showed the process by which I moved from CDX files to downloading a sample of the Canadian Internet, and ended it with my ingest into Solr. After a few dumb mistakes that slowed me down, it’s now fully working in the Carrot2 workbench (which I’ve discussed before). Thanks to Nick Ruest (@ruebot) at York University for helping me through some of the specifics behind Solr, and hopefully I’ll be able to move from that onwards to a more streamlined and efficient workflow soon.

To give you a sense of where I’m at now, let’s look at a few queries and their results. Let’s try labour.

Two visualizations of the 'Labour' search in the Canadian Wide Scrape subset.
Two visualizations of the ‘Labour’ search in the Canadian Wide Scrape subset.

Pretty cool stuff: we can get a sense of what is in these WARC files. Again, remember that one of the goals of this was to take these fairly opaque containers and get them to a position where we can monkey around with them using textual analysis.

Even more meaningfully, each of those clusters at right can be broken down into the individual pages that are contained: the 123 ‘Studies at York University,’ for example, or the 170 that make up “Employment Assistance Services.”

Breaking Clusters down into individual files.
Breaking Clusters down into individual files.

And from there, right down to the individual item – an individual webpage. It isn’t always that readable, as the output from WARC Tools doesn’t contain line breaks, but we’re getting there.

Not too pretty, I'll concede, but we're getting to info quickly.
Not too pretty, I’ll concede, but we’re getting to info quickly.

Basically, I’ve got a decently sophisticated search engine up and running on these WARC files.

Next Steps:

  • Mahout K-means clustering on the entire scrape text: took some wrangling with heap sizes and the like, but I think I’ve got this ready.
  • Rudimentary word and phrase computation for further reference
  • Preliminary explorations

Posted In:

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s