It Works! Cluster Results from a Million-Document Database

There was a fair amount of interest in my Carrot2 post, so this is a bit of a follow up. In short: it works on big datasets! On a trial collection of 1,023,042 URLs/documents, ingested from a large collection of websites, Solr indexed it in an hour and forty minutes. Carrot2 is working swimmingly (I increased memory in ~/solr/carrot2-workbench-3.7.1/carrot2-workbench.app/Contents/MacOS/carrot2-workbench.ini to approximately 7gb) and generates clusters from search queries.

In the first little bit that I’ve been playing with it, I’ve been impressed. To me, this really reinforces that I think this is a key tool that digital humanists might want to consider (and hopefully some people looking for similar tools stumble upon this blog post).

To give you a sense of how it works in practice, here’s a set of visualizations for the simple query history.

Screen Shot 2013-07-04 at 10.51.57 AM

We’ve got some great results: we can see the interest in family history, individual nationalities (American and Irish), the links to research and educational life (students), and other things in the top 100 results that I’ve pulled as a trial. “Links” appear as a topic, perhaps indicative of the development of a community linked to each other.

If we scale it up to 10,000 results we’re getting even more information, but still some distortions.

Screen Shot 2013-07-04 at 11.32.06 AM

It’s diverse, it’s picking up common separate ‘footer’ pages across the collection and clustering them, but we’ve got some gems: again, family history is key, but so are distinct themes: “This Day in History” appearing across many websites, little Factoids and stuff, lists, food, but then descending into individual topics, from American, to Church, to French, etc.

Anyways, it’s all early days and this has only been running for an hour or so.

3 thoughts on “It Works! Cluster Results from a Million-Document Database

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s