Well, things are coming together on my big project about working with Internet Archive holdings. In my last post, I showed the process by which I moved from CDX files to downloading a sample of the Canadian Internet, and ended it with my ingest into Solr. After a few dumb mistakes that slowed me down, it’s now fully working in the Carrot2 workbench (which I’ve discussed before). Thanks to Nick Ruest (@ruebot) at York University for helping me through some of the specifics behind Solr, and hopefully I’ll be able to move from that onwards to a more streamlined and efficient workflow soon.
To give you a sense of where I’m at now, let’s look at a few queries and their results. Let’s try labour.
Pretty cool stuff: we can get a sense of what is in these WARC files. Again, remember that one of the goals of this was to take these fairly opaque containers and get them to a position where we can monkey around with them using textual analysis.
Even more meaningfully, each of those clusters at right can be broken down into the individual pages that are contained: the 123 ‘Studies at York University,’ for example, or the 170 that make up “Employment Assistance Services.”
And from there, right down to the individual item – an individual webpage. It isn’t always that readable, as the output from WARC Tools doesn’t contain line breaks, but we’re getting there.
Basically, I’ve got a decently sophisticated search engine up and running on these WARC files.
- Mahout K-means clustering on the entire scrape text: took some wrangling with heap sizes and the like, but I think I’ve got this ready.
- Rudimentary word and phrase computation for further reference
- Preliminary explorations