Plotting and Comparing Locations Mentioned in a Web Archive: Warcbase, OpenRefine, and Google Fusion Tables

Screen Shot 2016-08-03 at 3.12.07 PM.png

Locations mentioned in North America in the Canadian Political Party archive collected in November 2015.

As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).

I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.

One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian?

NER comes with a host of disadvantages: it can be very messy, require lots of computational time, and once you begin to zoom into results, much of it is tricky. For example, when a Canadian Web archive discusses Paris, they probably mean Paris, France not Paris, Ontario. But when they discuss Peterborough, they mean Peterborough, Ontario not Peterborough in the United Kingdom (so it’s not as simple as always favouring the Canadian location, for example). So all of this needs to be kept in mind when we do this kind of work.

There are projects that have devoted themselves to working on cleaning up NER, esp. around context, so perhaps later on we’ll look into implementing them. But I suspect that NER is still an edge case for Web archives, so perhaps the messiness is an understandable artifact of the system.

The Three Case Studies

I used three web archives for this:

In a Joint Conference on Digital Libraries paper, Nick and I compared the URLs that appeared in these papers. You can read our paper there. What I’m hoping to do is extend that work to explore content differences as well.

How do we extract locations from Web archives?

There are many ways. For just working with the NER, warcbase has a workflow that uses Stanford NER to extract entities from a web archive. Alternatively, you can generate the plain text and use something like William Turkel’s named entity extraction walkthroughs.

In this case, I did the latter because I wanted to move through everything step by step. I generated the plain text using warcbase, used Turkel’s NER walkthrough, and ended up with a set of frequency lists like so.

My code to do so resembled this:

# CPP 201508

stanford-ner/ cpp-fulltext-201508.txt > cpp-fulltext-201508-ner.txt
sed 's/\/O / /g' < cpp-fulltext-201508-ner.txt > cpp-fulltext-201508-ner_clean.txt

# extracting locations and counting them
egrep -o -f locpattr cpp-fulltext-201508-ner_clean.txt > cpp-fulltext-201508-ner_loc.txt
cat cpp-fulltext-201508-ner_loc.txt | sed 's/\/LOCATION//g' | sort | uniq -c | sort -nr > all-cpp-fulltext-201508-ner_loc_freq.txt

Messy, but at every step of the way we have some files to play with.

The results then look like this (in the case of all-cpp-fulltext-201508-ner_loc_freq.txt):

140583 Canada
19065 Ontario
10429 Alberta
9728 Canada ,
9307 British Columbia
7843 Toronto
4890 Vancouver
4830 Russia
4572 US
4524 Ottawa

Note a few problems – the US is referring to the United States, there are two Canadas, and eventually we’ll also see abbreviations (i.e. Ont. and Man. for Ontario and Manitoba) and alternate spellings (United States and United States of America).

To clean that up, my RAs (cheers to Patrick O’Leary and Katie MacKinnon) used OpenRefine to clean up the data. OpenRefine doesn’t handle duplicates (strangely enough) but does a good job of finding close things.. so combining Canada and Canada ,, United States and United States of America, and so forth. There are several good walkthroughs online to show you how to do that.

Finally, to eliminate duplicates, I created a simple pivot table in Excel.

How do we plot locations?

I have never found a very simple way to plot locations. There’s QGIS, but it’s ungainly and not terribly user friendly. We ideally want something that easily publishes to the Web. Many of the Geocoding APIs seem to change regularly, and my rough sense is that the situation today is actually worse than it was three or four years ago in terms of free, easy-to-access Geocoding APIs. Oh well.

I thought Google Fusion Tables might be a useful thing to try.

For the three web archives, we repeated the steps and imported the data into Google Fusion Tables. We used the geocoder built right into Fusion Tables and created plots for each of the web archives. There are certainly limitations to this:

  • As noted, the geocoder is limited;
  • The geocoder outputs locations as points, which has a total of 9 different shapes and sizes to explore – so it’s not the most user friendly (I’m colour blind so I feel your pain) – it would be better if we could use a gradient;
  • Heatmaps can’t be published, only viewed. That’s a shame (plus the heatmap is skewed by the outlier values).

They’re below. Please click on them if you want to explore.

Locations found in the CPP web crawl from August 2015. Click on the image to explore.

Locations found in the CPP web crawl from August 2015. Click on the image to explore.

Locations found in the CPP web crawl from November 2015. Click on the image to explore

Locations found in the CPP web crawl from November 2015. Click on the image to explore

Locations found in the ELXN42 web crawl. Click on the image to explore.

Locations found in the ELXN42 web crawl. Click on the image to explore.

Next Steps?

A few things to think about:

  • First, we need to do the comparing. We’ll need to think about the best way to do that.
  • Second, overlaying all the locations on one map seems to make sense to me. But the trick will be to make that intelligible.
  • Third, it’s a good way to begin thinking about how to quickly visualize a web archive – could these serve as first passes?
  • Finally, I wonder if we could eventually consider linking the content of web archives to these maps. Imagine wanting the content that discusses “Winnipeg” and then putting them all up to display further.

More possibilities than answers at this point, but that’s the fun of exploratory research, right?


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s