Switching Gears into Writing Mode: A Short Update

The blog has been a bit quiet lately, and for good reason: I’ve been writing! This has been thanks to a pretty good workflow that’s been helping me organize research findings into discrete chunks. I wanted to surface briefly to both keep this blog and letting people know what I’ve been up to. I’ll focus on methodology here, although the manuscript I’m writing is right now entirely content-driven.

It comes as no surprise that I’ve been working with ever bigger datasets: the current one I’m working with is around 900GB, and the subset of that that I’m pulling a chapter together out of consists of 3,466,714 individual files. It’s an absurd amount of information.

Here, Carrot2 Workbench has been key. After ingesting sources with Solr, I’ve used the ResourceField field to populate search results. So we can now click on a link found with Carrot2 from clustering and be brought right to the source.

I’m able to trace out outlines of community relationships – in the below case, specifically the “community leader” apparatus that tied this web community together – and note overlaps between sources. I can’t read every single page, by any stretch of the imagination, but I can run these sorts of analytics.

This sort of clustering isn’t the be all and end all. I’ve been using LDA topic modelling, via MALLET (check out our Programming Historian 2 lesson) to get overviews. Even though I know it’s not magic, the results can be stunning. A topic model of an “entertainment” subset from the late-1990s came up with a cluster “joey rachel ross monica chandler” and a “children” subset came up with “pooh friends tigger winnie christopher color piglet.”

The "Community" Cluster, based on plotting 10,000 ranking results w/ Aduna Cluster visualization.
The “Community” Cluster, based on plotting 10,000 ranking results w/ Aduna Cluster visualization. It is of fairly minimal use as a visualization, but shows how a community leader apparatus existed. It also helps me find good, relevant pages.
Connective Tissue
Connective Tissue

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s