Using Extracted Names to Explore Web Archives

It should be no surprise that in 2009, the prominent NDP leader Jack Layton was the most frequent person mentioned on their site.

It should be no surprise that in 2009, the prominent NDP leader Jack Layton was the most frequent person mentioned on their site.

Yesterday morning, I used Mathematica‘s new geographic processes to use our Warcbase NER output in order to generate maps based on web archived locations – I then wondered what we could do with the frequency of individuals?

One challenge I have often encountered with NER output is the “what now?” question. We generate fantastic lists of frequently appearing individuals, organizations, and people, but apart from exploring tables or finding use for network analysis, I haven’t been terribly compelled by this (one exception is the Trading Consequences ‘location cloud’ visualization, which we’re currently trying to rip offborrow from)

Mathematica has powerful integration with Wolfram Alpha’s evergrowing databases, which contain large amounts of information on influential and even not-so-influential people: the sorts of people that are likely to show up in political web archives like I am using. Consider: Prime Minister Stephen Harper’s page and the sheer amount of data, or even the perhaps less-internationally-notable politician Rona Ambrose. I wondered if we could connect our NER frequency output with this database to find interesting ways to visualize the frequency with which people appear.

We can, of course. In a nutshell, the command:

Interpreter["Person"]["Stephen Harper"]

Will result in the “entity” Stephen Harper (in raw input form: “StephenHarper::54v64”), with a lengthy list of properties:

Screen Shot 2015-08-06 at 7.25.58 AM

One of these is ‘Image’, which can be accessed with this command:

PersonData[Entity["Person", "StephenHarper::54v64"], "Image"]

The idea was to then extract these images, and size them according to the frequency in which they appear.

It isn’t perfect: some of the entities extracted have clear errors (i.e. Stephen HarperLaureen which has somehow conjoined our PM’s name with the first name of his wife), and others won’t appear in the knowledge database. If we wanted to use this on a more traditional social or cultural historical archive, which would have many people not in the database, we’d either have to rely on Google Image Search or probably do more human intervention.

But for collections with notable entities – i.e. in this case, the Canadian Political Party collection, or say the United States Governmental Web Archive – I think this is a fruitful approach. It also aggregates entities, which is handy (i.e. NER sometimes outputs the same name twice).

To scale images, we can use Mathematica’s ImageCollage[] command, which lets you scale images based on a frequency value. I’ll paste the code below, but the results were very promising.

Here is the New Democratic Party’s extracted people from 2009:

Screen Shot 2015-08-06 at 7.15.55 AM

The low resolution will be fixed (dealing with an image processing issue right now). But in a nutshell, we see the large portrait of then party leader Jack Layton, the current PM (and opponent of this party) Stephen Harper, and other politicians.

Dion Francis DiMucci

Dion Francis DiMucci

There are some glaring errors. Stéphane Dion, the Liberal opposition leader between 2006 and 2008, isn’t in the database so Mathematica assigns the entity “Dion Francis DiMucci.” On the bright side, I now know about Dion Francis DiMucci. The database is improving daily, so let’s keep an eye on this for now.

As a front page of a finding aid, this might be useful, however.

Consider here, the Conservative Party of Canada, also from 2009:

Screen Shot 2015-08-06 at 7.31.14 AM

Here we’re seeing their leader and Prime Minister, Stephen Harper: but also President Obama, then finance minister Jim Flaherty, and opposition leader at the time Michael Ignatieff, and so forth. I found the number of even marginal Canadian politicians who were in the database surprising, and this is really useful.

I also did the Green Party of Canada – unfortunately, Elizabeth May wasn’t in the database (Queen Elizabeth was), but it was notable to see just how much content about George Bush appeared in the crawl despite it being 2009. It appears that they hosted a blog for anybody to write on, which contained a lot of Bush-centered content. They also didn’t maintain their pages as much as the other two parties, which tend to not keep dated content around.

So what can we learn from this?

While it’s not perfect, we’re getting a quick and dirty sense of the relative frequency of prominent people in web archives. It’s the sort of thing that could adorn an overview aid, along with geographic visualizations, word clouds (if you would permit me), frequency diagrams, and network analyses.

More importantly, it connects us to the Wolfram ecosystem. I chose images because it’s quick and dirty. But I could have also decided to look at “Places of Birth” – what parties had prominent people born where? Or their occupations? (politician) Their genders? And so forth. I suspect more entities will be available soon, including their popularity on Wikipedia!

It’s also relatively easy. A few lines of code, and we’re connecting web archives to another database.

Incidentally, I also took a look at organizations – but I think we’ll have to wait a few more weeks at least. I definitely put my e-mail down, though:

Screen Shot 2015-08-06 at 6.29.51 AM

Here’s my code, which is also available on GitHub. It is messy, but if you advance through the notebook one line at a time you should be able to see what I am doing.

persfreqraw = 
  Import["~/dropbox/Warcbase-NER-Visualization/conservative-200902-\
pers-freq.txt", "Lines"];

processedpers = {StringTrim[
       StringSplit[#, i : ("" | "$" ~~ NumberString) :> i]][[3]], 
     ToExpression[
      StringTrim[
        StringSplit[#, i : ("" | "$" ~~ NumberString) :> i]][[
       2]]]} & /@ persfreqraw;

persons = processedpers[[All, 1]];

personentities = Interpreter["Person"][#] & /@ Take[persons, 150];

personfreq = 
  Transpose@{personentities, Take[processedfreq[[All, 2]], 150]};

entitypos = Position[personfreq, _Entity, Infinity][[All, 1]];

people = personfreq[[entitypos]];

images = Cases[PersonData[#, "Image"] & /@ people[[All, 1]], 
   Except[_Missing]];

images = PersonData[#, "Image"] & /@ people[[All, 1]];

imagefreq = Transpose@{images, people[[All, 2]]};

delpos = Position[imagefreq, _Missing][[All, 1]];

justimages = Delete[imagefreq, Partition[delpos, 1]];

ImageCollage[Rule @@@ Reverse[justimages, 2], Background -> White, 
 ImagePadding -> 1]

2 thoughts on “Using Extracted Names to Explore Web Archives

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s