Mining the TV News Archives

(I’ve been AWOL for a long time now – I’ve been settling in at UW, and between teaching and working on some manuscripts, have found some time to tinker).

A few weeks ago, the Internet Archive announced its newest collection: a massively searchable database of news clips. Each clip has its closed-captioning transcription available for searching, and a thirty-second blurb from it can be played under fair use. A frustrating thing for a postwar historian is how inaccessible TV has traditionally been as a source: costly, no searching function, and time-consuming. As a result, historians have usually fallen back on newspapers and so forth.

Check it out yourself: it’s addictive.

One downside of this collection, however, is that it’s just so big. “Canada” alone pulls up over 47,000 hits since 2009. Computational methods are going to be required. But think of the potential: we could harness the power of TV news during this time, to get a sense of how a given topic came up in the eternal 24-hours news cycle.

What can we do? Read more

Step Aside, iTunes: Using MALLET to Create Playlists

If only there was some way to make a playlist relating to “love” songs, I thought, without turning to iTunes’ Genius mixes (or, heaven forbid, finding the songs manually).

As a series of posts in August mentioned, I’ve been using MALLET to topic model Top 40 music from 1964 to 1989. As a way to get a sense of what my topics mean I’ve been pulling up the top 20 songs that relate to a given topic. This is invaluable: you can get some obscure topics that don’t make sense to you, at first glance, but by looking at the top documents you can begin to make sense of it. This also all helped to confirm that MALLET is indeed working.

With the semester, however, I’m looking to refresh some of my playlists. As a busy person, like any good digital humanist, I realized this cried out for an automated solution*!

Take that, iTunes Genius! Read more

Topic Modelling in the Lyrics Database, Part Three: Talking to Wolfram Alpha

(and here my series continues… I’m blogging through August mainly to keep the work going when it can be so easy to sneak away, and this is more of an internal diary than anything else!)

Importing economic data into Mathematica – it really is this easy…

Mathematica is made by the same company, Wolfram Research, that brings us Wolfram Alpha – the computational knowledge engine that powers parts of Siri, as well as being an overall fun resource to use as historians, tinkerers, or well, anybody (I’ve written about it before). As a diversion, I thought I would start comparing economic data to the topics that I am finding through MALLET.

Using free-form input, let’s get annual figures for unemployment in the US, 1964-1989.

With that done, we can then manipulate our data – getting them into comparable datasets – and begin to run correlations. Let’s see if we can find correlations in topic occurrences against the unemployment rate… Read more

Topic Modelling in the Lyrics Database, Part Two: Finding Trends

A bit of a mess of a visualization, but here’s if we put all the different topics against each other. Using tooltips, we can figure out the most common throughout – and turn them on and off.

In yesterday’s post, I introduced some of the work I’ve been doing with MALLET and provided a list of topics, sparklines, etc. I wanted today to pull some of that data out and see what trends we could find in the database. Many of the topics found had simple little spikes: a single year where the topic was significant, but not part of a broader trend. Twenty, however, had either raises or falls, and looked like they were worth more investigation.

I divide them into four groups: (1) those who became more prominent since 1964 until 1989; (2) those who became less prominent; (3) those that were always prominent; and (4) those who displayed other sorts of statistical behaviour. Let’s take a closer look.

In a future post, I will be going into detail with individual songs to find exemplars of these topics. This topic is more speculation about what might be happening, and helping us think about how these topics could help us in our scholarship… Read more

Topic Modelling in the Lyrics Database, Part One: Checking Out Topics

I’ve been playing a lot with MALLET (MAchine Learning for LanguagE Toolkit), a command-line program developed at UMass Amherst. Combining it with my Top 40 Lyrics DB, which I’ve discussed elsewhere, I’ve been able to pick out frequently occurring clusters of words (or topics – hence “topic modelling”). With this corpus, after some experimentation, I began with picking out the top 50 topics that appeared.

As I spend the last pre-teaching month of the summer trying to program at least half a day everyday (the other half is book and article writing/revising time), I’ve been having a lot of fun tinkering with this material. Topic modelling is proving even more fruitful than keyword searching, mainly as the data comes to me rather than the other way around.

The only downside of MALLET is that the output can be a bit opaque without putting it into another environment. Shawn Graham has a great series on using the Gephi GUI to process it (if you want to use MALLET yourself, his how-to guide is an amazing resource; we have a forthcoming piece in the Programming Historian 2 that will also help new users). I’ve been importing it into Mathematica, my own programming platform of choice. Below is my first level of visualizing, a series of sparklines with topics. After this, I can take the number, plug it into another Mathematica cell, and look at the findings in a bit more detail. Read more