Fun with Mahout Clustering (No, Really)

Part of the Mahout Cluster output of the 'Nashville' neighbourhood of GeoCities. — Part of the Mahout Cluster output of the ‘Nashville’ neighbourhood of GeoCities.

Part of the fun of being on vacation is being able to tinker around, guiltlessly, with whatever I want. So between sticky buns in Alma, New Brunswick and ‘arduous’ hikes with the right supplies (nuts and fiction books), I was able to take a quick break to play with Mahout. I decided to play with it on the ‘Nashville’ neighbourhood of GeoCities: a country and western themed place.

This is just an introduction to what I’ve done with it, and takes us quickly from installation through to creating your first clusters on a collection of text documents. The tutorials I link to are far more exhaustive. However, I think that this has the potential to be part of my toolkit going forward (alongside MALLET), and I wanted to share the joy.

What does Mahout do?

I’ve been really interested in using clustering for text analysis, and have previously implemented this with Solr and the Carrot2 workbench to really, really good results. Indeed, workbench has become my de facto research manager for a big data project I’m working on right now. The down side, however, is that it’s a GUI interface: great for specific things, but since so much of my workflow involves the command line, a bit more customization might be good.

So to this end, enter Apache Mahout. It does four things: recommending (taking what you’ve done before and extrapolating future interests), clustering, classifications (i.e. tagging your unlabelled documents correctly), and looking at items and seeing which ones frequently appear together.

Installing

Installing Mahout can be a bit persnickety, but the instructions here are quite good. In short, a few commands can get this going if you’ve got SVN installed. For Mac, it’s built into XCode’s Command Line Tools suite (Which you’d need for a compiler and a host of other things). You will also need Apache Maven up and running, which can be installed relatively quickly. In short:

svn co http://svn.apache.org/repos/asf/mahout/trunk does the download, and running:

In trunk, mvn install, and then in the core directory mvn compile and mvn install, and then in the examples directory mvn compile. The tests take a long time, and the tutorials have workarounds for that, but if you’re patient it should be all good.

Getting Started

On my system, I have a bunch of text files from GeoCities: they were all HTML, but I’ve extracted them into Plain Text for experimentation purposes with Mathematica (although could probably also be done with textutil in OS X or html2text).

Drawing on this tutorial, “Quick Tour of Text Analysis using the Mahout Command Line,” I was able to get K-Means clustering up and running. This is for groups of 20 clusters. I’m just providing the commands from the tutorial here, so please do check it out. You can also set it up to run on a Hadoop cluster, which I did afterwards. It’s been less successful, as I’ve only got the recommender to work, so I’ll leave that for after vacation.

By putting them here, however, I think you can see how relatively painless the process can be:

./bin/mahout seqdirectory -c UTF-8 -i /users/ianmilligan1/desktop/nashville/ -o nashville-seqfiles

This ingests the files in the ‘nashville’ neighbourhood of GeoCities, that I’ve plucked out and put into my desktop. It then outputs them in the seqfile format for Mathout.

./bin/mahout seq2sparse -i nashville-seqfiles/ -o nashville-vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv

This generates the vectors, which are TF-IDF – a staple of textual analysis.

./bin/mahout kmeans -i nashville-vectors/tfidf-vectors/ -c nashville-kmeans-centroids -cl -o nashville-kmeans-clusters -k 20 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Now we’re getting somewhere – this generates the clusters. I’ve picked 20 here by default, but you could obviously pick higher and lower.

./bin/mahout clusterdump -d nashville-vectors/dictionary.file-0 -dt sequencefile -i nashville-kmeans-clusters/clusters-2-final -n 20 -b 100 -o cdump.txt -p nashville-kmeans-clusters/clusteredPoints/

We’ve got clusters, but we want to do something with them. So this goes through and dumps them into a file, cdump.txt [read here for more on clusterdumper].

Once we look at cdump.txt, we can see a ton of detail. Each document is broken down into different weights on words, and assigned to a cluster. To get a sense of the clusters, you can search for ‘Top Terms:’ and get the list of words, and then go into documents more in detail.

For example,
woody => 0.06504105230999421 guthrie => 0.06345693675681902 page => 0.04458249647461463 song => 0.04078701608223392 analysis => 0.04062143099383295 critical => 0.04060892694443723 almanac => 0.0405323956250233 study => 0.0399896335907644 purpose => 0.03898064585582698 review => 0.03817140674218612 starting => 0.03763161793353704 copyrighted =>0.037198304307685746 material =>0.035867691515149236 he => 0.03571745245053116 seeger => 0.03538629134007021 lyrics => 0.03409052496017037 york =>0.033244908470550234 singers => 0.03305807787915658 top => 0.03239995279291896 songs => 0.03234542755300314

These terms, as the top ones occurring in a cluster, can give you a sense of what to name it. This cluster represents something a bit surprising in this corpus: websites relating to Woody Guthrie, Pete Seeger, and critical analyses of their lyrics. Given the country and western nature of this neighbourhood, that one of the 20 clusters reflects this, helps complicate our understanding of what we might find within.

Just imagine – you can take a large corpus, cluster it, and voila, you’d have some natural structure emerging in your work already. It’s an interesting way to think about starting a large, Big Data project.

2 thoughts on “Fun with Mahout Clustering (No, Really)”

Fauzy Che Yayah says:

8 February 2014 at 1:52 am

Hi , can i have example of your input file ? or links. Thanks.

Yogesh says:

25 July 2014 at 7:13 am

Hi, thanks for this. I was wondering if you could share details of how the centroid file (nashville-kmeans-centroids) was generated i.e. the format of the file, the logic for creating it and perhaps a sample as well.
Appreciate your help.

Ian Milligan

Fun with Mahout Clustering (No, Really)

2 thoughts on “Fun with Mahout Clustering (No, Really)”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Fun with Mahout Clustering (No, Really)”

Leave a comment Cancel reply