(I’ve been AWOL for a long time now – I’ve been settling in at UW, and between teaching and working on some manuscripts, have found some time to tinker).
A few weeks ago, the Internet Archive announced its newest collection: a massively searchable database of news clips. Each clip has its closed-captioning transcription available for searching, and a thirty-second blurb from it can be played under fair use. A frustrating thing for a postwar historian is how inaccessible TV has traditionally been as a source: costly, no searching function, and time-consuming. As a result, historians have usually fallen back on newspapers and so forth.
Check it out yourself: it’s addictive.
One downside of this collection, however, is that it’s just so big. “Canada” alone pulls up over 47,000 hits since 2009. Computational methods are going to be required. But think of the potential: we could harness the power of TV news during this time, to get a sense of how a given topic came up in the eternal 24-hours news cycle.
What can we do?
I decided to do a trial study and use Mathematica to read some of the transcripts for me. Luckily, it can extract the plaintext that’s there. Some tweaking, and I had Mathematica following the “next” results link on each page. I decided to start out small: around 1,500 transcripts or so. I elected to just go back until February 2012. It’s a large amount of information: some 8.1MB of just plain text. With them, I could then begin crunching the data.
Word frequency is interesting, with some common ones such as “president,” “morning,” “because,” and so forth: major themes in the American news cycle, the urge to explain why something happens, and the news cycle itself. It’s fun to play with, and with enough data you could do the Culturomics approach of watching issues rise and fall (with the n-gram, etc.).
Given my current interests in topic modelling, I was curious what would appear if I ran this archive through MALLET. Sure enough, we see some major themes appear – under two main headings. First, we see those topics that appear consistently with mentions of Canada: weather (“back high today weather morning west low winds pressure sunday tonight newsline south cool forecast chance dry” – that cold front that Canadians keep sending America, it appears); the general language of the news (“morning city hour report eyewitness american coming weekend story called end”), and the consistent state-to-state relations (i.e. “canada time america country states washington etc.”).
There are also topics that revolve around individual events that receive a ton of attention: an incident involving a flight between Holland Toronto, or a derailment, or mentions around the Olympics, even a topic that includes a story that must have broken about Justin Bieber!
I’ve put the topics below, ordered in order of importance – i.e. topic 32, at the top, was the most frequent topic overall (as a mean of all documents). The sparkline at right shows how frequency has ebbed and flowed between February 2012 and today (October 3rd).