I’ve been playing a lot with MALLET (MAchine Learning for LanguagE Toolkit), a command-line program developed at UMass Amherst. Combining it with my Top 40 Lyrics DB, which I’ve discussed elsewhere, I’ve been able to pick out frequently occurring clusters of words (or topics – hence “topic modelling”). With this corpus, after some experimentation, I began with picking out the top 50 topics that appeared.
As I spend the last pre-teaching month of the summer trying to program at least half a day everyday (the other half is book and article writing/revising time), I’ve been having a lot of fun tinkering with this material. Topic modelling is proving even more fruitful than keyword searching, mainly as the data comes to me rather than the other way around.
The only downside of MALLET is that the output can be a bit opaque without putting it into another environment. Shawn Graham has a great series on using the Gephi GUI to process it (if you want to use MALLET yourself, his how-to guide is an amazing resource; we have a forthcoming piece in the Programming Historian 2 that will also help new users). I’ve been importing it into Mathematica, my own programming platform of choice. Below is my first level of visualizing, a series of sparklines with topics. After this, I can take the number, plug it into another Mathematica cell, and look at the findings in a bit more detail.
Click on the following picture for more detail – if you want to see changes in pop music lyrics btw 1964 and 1989. The first figure is just the topics that have significant shifts, followed by all topics:
And if we pick a promising one to zoom in on – we learn that “tonight night give wanna fire rock inside lover body light burning crazy dream hot” would be a great formula for a successful 1980s pop song. Not so much in the 1960s, though….