(I’ve been AWOL for a long time now – I’ve been settling in at UW, and between teaching and working on some manuscripts, have found some time to tinker).
A few weeks ago, the Internet Archive announced its newest collection: a massively searchable database of news clips. Each clip has its closed-captioning transcription available for searching, and a thirty-second blurb from it can be played under fair use. A frustrating thing for a postwar historian is how inaccessible TV has traditionally been as a source: costly, no searching function, and time-consuming. As a result, historians have usually fallen back on newspapers and so forth.
Check it out yourself: it’s addictive.
One downside of this collection, however, is that it’s just so big. “Canada” alone pulls up over 47,000 hits since 2009. Computational methods are going to be required. But think of the potential: we could harness the power of TV news during this time, to get a sense of how a given topic came up in the eternal 24-hours news cycle.
Building on my theme of using Wolfram|Alpha to figure out things about the past, I wondered if I could write a program that could take a document or secondary source, extract all of the dates, and then let you know about that day’s weather. I envision this as eventually being a reader that pops up weather data as you’re reading something, giving you some added context about the date being mentioned. It would have to be automated, though, otherwise you might not want to use it!
Luckily, Mathematica has date recognition functions built in. As a proof of concept, I decided to test it out on the Wikipedia page for Stephen Harper (Canada’s current Prime Minister). We’ll try it out on primary documents below. Some jigging is required depending on the date format, but I could easily write a function that could get all dates. (more…)
(1) Extract all of the Proper Nouns, Singular (NNP), which should hopefully contain most of the names in the document. I would do this using Stanford NLP, which is part of the SEASR MEANDRE workbench.
(2) Import the data into Mathematica, creating a list of names.
(3) Using Mathematica, query Wolfram|Alpha, extracting the relative popularity of the name as a percentage of births.
(4) Plot the data. (more…)
(and here my series continues… I’m blogging through August mainly to keep the work going when it can be so easy to sneak away, and this is more of an internal diary than anything else!)
Mathematica is made by the same company, Wolfram Research, that brings us Wolfram Alpha – the computational knowledge engine that powers parts of Siri, as well as being an overall fun resource to use as historians, tinkerers, or well, anybody (I’ve written about it before). As a diversion, I thought I would start comparing economic data to the topics that I am finding through MALLET.
Using free-form input, let’s get annual figures for unemployment in the US, 1964-1989.
With that done, we can then manipulate our data – getting them into comparable datasets – and begin to run correlations. Let’s see if we can find correlations in topic occurrences against the unemployment rate… (more…)
In yesterday’s post, I introduced some of the work I’ve been doing with MALLET and provided a list of topics, sparklines, etc. I wanted today to pull some of that data out and see what trends we could find in the database. Many of the topics found had simple little spikes: a single year where the topic was significant, but not part of a broader trend. Twenty, however, had either raises or falls, and looked like they were worth more investigation.
I divide them into four groups: (1) those who became more prominent since 1964 until 1989; (2) those who became less prominent; (3) those that were always prominent; and (4) those who displayed other sorts of statistical behaviour. Let’s take a closer look.
In a future post, I will be going into detail with individual songs to find exemplars of these topics. This topic is more speculation about what might be happening, and helping us think about how these topics could help us in our scholarship… (more…)
As I spend the last pre-teaching month of the summer trying to program at least half a day everyday (the other half is book and article writing/revising time), I’ve been having a lot of fun tinkering with this material. Topic modelling is proving even more fruitful than keyword searching, mainly as the data comes to me rather than the other way around.
The only downside of MALLET is that the output can be a bit opaque without putting it into another environment. Shawn Graham has a great series on using the Gephi GUI to process it (if you want to use MALLET yourself, his how-to guide is an amazing resource; we have a forthcoming piece in the Programming Historian 2 that will also help new users). I’ve been importing it into Mathematica, my own programming platform of choice. Below is my first level of visualizing, a series of sparklines with topics. After this, I can take the number, plug it into another Mathematica cell, and look at the findings in a bit more detail. (more…)