Part of the fun of being on vacation is being able to tinker around, guiltlessly, with whatever I want. So between sticky buns in Alma, New Brunswick and ‘arduous’ hikes with the right supplies (nuts and fiction books), I was able to take a quick break to play with Mahout. I decided to play with it on the ‘Nashville’ neighbourhood of GeoCities: a country and western themed place.
This is just an introduction to what I’ve done with it, and takes us quickly from installation through to creating your first clusters on a collection of text documents. The tutorials I link to are far more exhaustive. However, I think that this has the potential to be part of my toolkit going forward (alongside MALLET), and I wanted to share the joy.
What does Mahout do?
I’ve been really interested in using clustering for text analysis, and have previously implemented this with Solr and the Carrot2 workbench to really, really good results. Indeed, workbench has become my de facto research manager for a big data project I’m working on right now. The down side, however, is that it’s a GUI interface: great for specific things, but since so much of my workflow involves the command line, a bit more customization might be good.
So to this end, enter Apache Mahout. It does four things: recommending (taking what you’ve done before and extrapolating future interests), clustering, classifications (i.e. tagging your unlabelled documents correctly), and looking at items and seeing which ones frequently appear together.
Installing Mahout can be a bit persnickety, but the instructions here are quite good. In short, a few commands can get this going if you’ve got SVN installed. For Mac, it’s built into XCode’s Command Line Tools suite (Which you’d need for a compiler and a host of other things). You will also need Apache Maven up and running, which can be installed relatively quickly. In short:
svn co http://svn.apache.org/repos/asf/mahout/trunk does the download, and running:
mvn install, and then in the core directory
mvn compile and
mvn install, and then in the examples directory
mvn compile. The tests take a long time, and the tutorials have workarounds for that, but if you’re patient it should be all good.
On my system, I have a bunch of text files from GeoCities: they were all HTML, but I’ve extracted them into Plain Text for experimentation purposes with Mathematica (although could probably also be done with textutil in OS X or html2text).
Drawing on this tutorial, “Quick Tour of Text Analysis using the Mahout Command Line,” I was able to get K-Means clustering up and running. This is for groups of 20 clusters. I’m just providing the commands from the tutorial here, so please do check it out. You can also set it up to run on a Hadoop cluster, which I did afterwards. It’s been less successful, as I’ve only got the recommender to work, so I’ll leave that for after vacation.
By putting them here, however, I think you can see how relatively painless the process can be:
./bin/mahout seqdirectory -c UTF-8 -i /users/ianmilligan1/desktop/nashville/ -o nashville-seqfiles
This ingests the files in the ‘nashville’ neighbourhood of GeoCities, that I’ve plucked out and put into my desktop. It then outputs them in the seqfile format for Mathout.
./bin/mahout seq2sparse -i nashville-seqfiles/ -o nashville-vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv
This generates the vectors, which are TF-IDF – a staple of textual analysis.
./bin/mahout kmeans -i nashville-vectors/tfidf-vectors/ -c nashville-kmeans-centroids -cl -o nashville-kmeans-clusters -k 20 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
Now we’re getting somewhere – this generates the clusters. I’ve picked 20 here by default, but you could obviously pick higher and lower.
./bin/mahout clusterdump -d nashville-vectors/dictionary.file-0 -dt sequencefile -i nashville-kmeans-clusters/clusters-2-final -n 20 -b 100 -o cdump.txt -p nashville-kmeans-clusters/clusteredPoints/
We’ve got clusters, but we want to do something with them. So this goes through and dumps them into a file, cdump.txt [read here for more on clusterdumper].
Once we look at cdump.txt, we can see a ton of detail. Each document is broken down into different weights on words, and assigned to a cluster. To get a sense of the clusters, you can search for ‘Top Terms:’ and get the list of words, and then go into documents more in detail.
woody => 0.06504105230999421
guthrie => 0.06345693675681902
page => 0.04458249647461463
song => 0.04078701608223392
analysis => 0.04062143099383295
critical => 0.04060892694443723
almanac => 0.0405323956250233
study => 0.0399896335907644
purpose => 0.03898064585582698
review => 0.03817140674218612
starting => 0.03763161793353704
he => 0.03571745245053116
seeger => 0.03538629134007021
lyrics => 0.03409052496017037
singers => 0.03305807787915658
top => 0.03239995279291896
songs => 0.03234542755300314
These terms, as the top ones occurring in a cluster, can give you a sense of what to name it. This cluster represents something a bit surprising in this corpus: websites relating to Woody Guthrie, Pete Seeger, and critical analyses of their lyrics. Given the country and western nature of this neighbourhood, that one of the 20 clusters reflects this, helps complicate our understanding of what we might find within.
Just imagine – you can take a large corpus, cluster it, and voila, you’d have some natural structure emerging in your work already. It’s an interesting way to think about starting a large, Big Data project.