Mining the TV News Archives

(I’ve been AWOL for a long time now – I’ve been settling in at UW, and between teaching and working on some manuscripts, have found some time to tinker).

A few weeks ago, the Internet Archive announced its newest collection: a massively searchable database of news clips. Each clip has its closed-captioning transcription available for searching, and a thirty-second blurb from it can be played under fair use. A frustrating thing for a postwar historian is how inaccessible TV has traditionally been as a source: costly, no searching function, and time-consuming. As a result, historians have usually fallen back on newspapers and so forth.

Check it out yourself: it’s addictive.

One downside of this collection, however, is that it’s just so big. “Canada” alone pulls up over 47,000 hits since 2009. Computational methods are going to be required. But think of the potential: we could harness the power of TV news during this time, to get a sense of how a given topic came up in the eternal 24-hours news cycle.

What can we do? Read more

Extracting Weather Data for All Dates in a Document: Adding Context to a Document

The weather in Ottawa for Wednesday, November 25th, 1981 (date extracted from Trudeau-Lévesque correspondence). What if we could get this data for every date in a document?

Building on my theme of using Wolfram|Alpha to figure out things about the past, I wondered if I could write a program that could take a document or secondary source, extract all of the dates, and then let you know about that day’s weather. I envision this as eventually being a reader that pops up weather data as you’re reading something, giving you some added context about the date being mentioned. It would have to be automated, though, otherwise you might not want to use it!

Luckily, Mathematica has date recognition functions built in. As a proof of concept, I decided to test it out on the Wikipedia page for Stephen Harper (Canada’s current Prime Minister). We’ll try it out on primary documents below. Some jigging is required depending on the date format, but I could easily write a function that could get all dates. Read more

Using Stanford NLP and Wolfram|Alpha to Help Date a Document?

The popularity of the name ‘Daniel.’ Full disclosure: this idea began when I was running spurious comparisons btw lyrics data and names.

Yesterday, I spent most of the morning playing around with Wolfram|Alpha and integrating it with some of my topic models. As a brief diversion (based on a comment William Turkel made at one of his Going Digital in Two Hours seminars a few years ago), I wondered if I could try to learn more about a document using extracted names and birth popularity.

This is just playing around with the concept.

What I would do was the following:

(1) Extract all of the Proper Nouns, Singular (NNP), which should hopefully contain most of the names in the document. I would do this using Stanford NLP, which is part of the SEASR MEANDRE workbench.
(2) Import the data into Mathematica, creating a list of names.
(3) Using Mathematica, query Wolfram|Alpha, extracting the relative popularity of the name as a percentage of births.
(4) Plot the data. Read more

Topic Modelling in the Lyrics Database, Part Three: Talking to Wolfram Alpha

(and here my series continues… I’m blogging through August mainly to keep the work going when it can be so easy to sneak away, and this is more of an internal diary than anything else!)

Importing economic data into Mathematica – it really is this easy…

Mathematica is made by the same company, Wolfram Research, that brings us Wolfram Alpha – the computational knowledge engine that powers parts of Siri, as well as being an overall fun resource to use as historians, tinkerers, or well, anybody (I’ve written about it before). As a diversion, I thought I would start comparing economic data to the topics that I am finding through MALLET.

Using free-form input, let’s get annual figures for unemployment in the US, 1964-1989.

With that done, we can then manipulate our data – getting them into comparable datasets – and begin to run correlations. Let’s see if we can find correlations in topic occurrences against the unemployment rate… Read more

Topic Modelling in the Lyrics Database, Part Two: Finding Trends

A bit of a mess of a visualization, but here’s if we put all the different topics against each other. Using tooltips, we can figure out the most common throughout – and turn them on and off.

In yesterday’s post, I introduced some of the work I’ve been doing with MALLET and provided a list of topics, sparklines, etc. I wanted today to pull some of that data out and see what trends we could find in the database. Many of the topics found had simple little spikes: a single year where the topic was significant, but not part of a broader trend. Twenty, however, had either raises or falls, and looked like they were worth more investigation.

I divide them into four groups: (1) those who became more prominent since 1964 until 1989; (2) those who became less prominent; (3) those that were always prominent; and (4) those who displayed other sorts of statistical behaviour. Let’s take a closer look.

In a future post, I will be going into detail with individual songs to find exemplars of these topics. This topic is more speculation about what might be happening, and helping us think about how these topics could help us in our scholarship… Read more

Topic Modelling in the Lyrics Database, Part One: Checking Out Topics

I’ve been playing a lot with MALLET (MAchine Learning for LanguagE Toolkit), a command-line program developed at UMass Amherst. Combining it with my Top 40 Lyrics DB, which I’ve discussed elsewhere, I’ve been able to pick out frequently occurring clusters of words (or topics – hence “topic modelling”). With this corpus, after some experimentation, I began with picking out the top 50 topics that appeared.

As I spend the last pre-teaching month of the summer trying to program at least half a day everyday (the other half is book and article writing/revising time), I’ve been having a lot of fun tinkering with this material. Topic modelling is proving even more fruitful than keyword searching, mainly as the data comes to me rather than the other way around.

The only downside of MALLET is that the output can be a bit opaque without putting it into another environment. Shawn Graham has a great series on using the Gephi GUI to process it (if you want to use MALLET yourself, his how-to guide is an amazing resource; we have a forthcoming piece in the Programming Historian 2 that will also help new users). I’ve been importing it into Mathematica, my own programming platform of choice. Below is my first level of visualizing, a series of sparklines with topics. After this, I can take the number, plug it into another Mathematica cell, and look at the findings in a bit more detail. Read more