The blog has been a bit quiet lately, and for good reason: I’ve been writing! This has been thanks to a pretty good workflow that’s been helping me organize research findings into discrete chunks. I wanted to surface briefly to both keep this blog and letting people know what I’ve been up to. I’ll focus on methodology here, although the manuscript I’m writing is right now entirely content-driven.
It comes as no surprise that I’ve been working with ever bigger datasets: the current one I’m working with is around 900GB, and the subset of that that I’m pulling a chapter together out of consists of 3,466,714 individual files. It’s an absurd amount of information.
Here, Carrot2 Workbench has been key. After ingesting sources with Solr, I’ve used the
ResourceField field to populate search results. So we can now click on a link found with Carrot2 from clustering and be brought right to the source.
I’m able to trace out outlines of community relationships – in the below case, specifically the “community leader” apparatus that tied this web community together – and note overlaps between sources. I can’t read every single page, by any stretch of the imagination, but I can run these sorts of analytics.
This sort of clustering isn’t the be all and end all. I’ve been using LDA topic modelling, via MALLET (check out our Programming Historian 2 lesson) to get overviews. Even though I know it’s not magic, the results can be stunning. A topic model of an “entertainment” subset from the late-1990s came up with a cluster “joey rachel ross monica chandler” and a “children” subset came up with “pooh friends tigger winnie christopher color piglet.”