Using Stanford NLP and Wolfram|Alpha to Help Date a Document?

The popularity of the name ‘Daniel.’ Full disclosure: this idea began when I was running spurious comparisons btw lyrics data and names.

Yesterday, I spent most of the morning playing around with Wolfram|Alpha and integrating it with some of my topic models. As a brief diversion (based on a comment William Turkel made at one of his Going Digital in Two Hours seminars a few years ago), I wondered if I could try to learn more about a document using extracted names and birth popularity.

This is just playing around with the concept.

What I would do was the following:

(1) Extract all of the Proper Nouns, Singular (NNP), which should hopefully contain most of the names in the document. I would do this using Stanford NLP, which is part of the SEASR MEANDRE workbench.
(2) Import the data into Mathematica, creating a list of names.
(3) Using Mathematica, query Wolfram|Alpha, extracting the relative popularity of the name as a percentage of births.
(4) Plot the data.

It wouldn’t be perfect: some names would be missed, the data would only be for births – so it couldn’t date it necessarily based upon the current age of people but rather giving a sense of when they might have been born, and a host of other issues. But at a minimum, if you had a document and had no clue of its provenance, this might give you one more bit of information to go on.

So, let’s see how it works in practice:

The MEANDRE workflow for extracting names from a document.

(1) I took the following article from the front page of the Toronto Star today: “Mayor Rob Ford caught reading while driving, says ‘I’m a busy man'”. Bufoonery aside, there would be names here: the mayor, the reporter, his brother, a local radio host, etc. It was short and topical.

Putting it into SEASR, I modified one of the basic flows (Demo POS) to just include NNPs. You can get a list of all the different POSes that it can extract here.

I then had a list of results.

(2) Rather awkwardly (in terms of the process), I imported it into Mathematica. A regular expression [a-zA-Z]+ extracted just the names, I dropped the NNPs which were attached, and we then had a list of names like so:

{"Daniel", "Dale", "Urban", "Affairs", "Reporter", "Mayor", "Rob", "Ford", "Gardiner", "Expressway", "Tuesday", "Twitter", "Ford", "Yeah", "Yeah", "Gardiner", "I", "m", "Chicago", "September", "Ford", "Ford", "Ford", "Ford", "Ford", "Ford", "Cadillac", "Escalade", "Ryan", "Haughton", "Haughton", "Dean", "Blundell", "Ford", "DeanBlundell", "Ford", "Gardner", "Haughton", "Haughton", "Adrienne", "Batra", "Ford", "s", "Toronto", "Sun", "Police", "Const", "Clint", "Stibbe", "Stibbe", "Ford", "Councillor", "Doug", "Ford", "Doug", "Ford", "July", "Ottilie", "Mason", "Rob", "Ford", "Ford", "June", "Ford", "Ford", "DON", "BOSCO", "July", "SUV", "Ford", "Tuesday"}

Then I decided to, with a simple loop, query Wolfram|Alpha on each of them:

bucket = {};
 query = "popularity of name " <> name;
 result =
 query, {{"History:GivenNameData", 1}, "TimeSeriesData"}];
 AppendTo[bucket, result];
 , {name, results}];
cleaned = DeleteCases[bucket, _Missing];

Many of the above return empty, as words like “Affairs” are unsurprisingly not common mames. But once we get those out, we can plot the results:

DateListPlot[Tooltip[cleaned], Joined -> True, PlotRange -> All]

As I noted above, you’re mostly here saying the dates of births of various actors.. and even then, the problem of outliers is a considerable, considerable one. In this case, it’s a bit disappointing: you might date most of the actors to the 1980s, when they were all born before that period. But based on the sharp uptake just after 1970, that’s a bit promising… it turns out that the youngest name (the peak) is the reporter’s first name, Daniel: a peak name in 1984, declining thereafter.

Aha – there’s the problem! Last names are throwing this off – many of these last names had been first names, but would have seemed odd. I don’t yet know of an automated way to do this, so let’s manually pick out the first names:

names = {“daniel”, “rob”, “ryan”, “dean”, “adrienne”, “clint”, “doug”,

And… (x axis is years, y axis is percentage of babies named a name in a given year)

A bit better! We can now see when actors named in this were probably born..

It’s still too messy. We have the problem of first names versus last names, but we’re close to getting this fairly automated…

Still, it’s a neat trick that I hope to use on some digitized documents soon.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s