As part of my McLuhan fellowship, I’ve been laying the groundwork for the work we’ll be doing over the Fall and Winter by generating sets of derivative datasets to run our analyses on. As noted, one of the main goals of the Fellowship is to develop new approaches to comparing the content found within web archives (to see, for example, if a web archive created by curators differs and in what respect from a web archive created by users on a hashtag).
I’m not providing much analysis here because that’s what we’ll be doing over the year, so this is mostly just focused on the data wrangling.
One of the approaches that I’m hoping to use is to compare the named entities present within web archives. So, for example, what locations are discussed in a web archive crawled from a seed list of URLs tweeted by everyday users on a hashtag versus the web archive crawled by a seed list manually generated by a librarian?
NER comes with a host of disadvantages: it can be very messy, require lots of computational time, and once you begin to zoom into results, much of it is tricky. For example, when a Canadian Web archive discusses Paris, they probably mean Paris, France not Paris, Ontario. But when they discuss Peterborough, they mean Peterborough, Ontario not Peterborough in the United Kingdom (so it’s not as simple as always favouring the Canadian location, for example). So all of this needs to be kept in mind when we do this kind of work.
There are projects that have devoted themselves to working on cleaning up NER, esp. around context, so perhaps later on we’ll look into implementing them. But I suspect that NER is still an edge case for Web archives, so perhaps the messiness is an understandable artifact of the system.
The Three Case Studies
I used three web archives for this:
- The #elxn42 crawl, a Web archive created by Nick Ruest using the URLs tweeted by users on that hashtag. Elxn42 was the hashtag used for the 42nd Canadian federal election, waged in August – October 2015. You can read more about the tweets and beyond in an article Nick and I published in Code4Lib Journal.
- The Canadian Political Parties and Political Interest Groups collection, crawled August 2015. If you’ve read my blog, you know all about this, but you can discover more by reading the about page at our webarchives.ca about page.
- The Canadian Political Parties and Political Interest Groups collection, crawled November 2015. Same as above, just collected later.
In a Joint Conference on Digital Libraries paper, Nick and I compared the URLs that appeared in these papers. You can read our paper there. What I’m hoping to do is extend that work to explore content differences as well.
How do we extract locations from Web archives?
There are many ways. For just working with the NER, warcbase has a workflow that uses Stanford NER to extract entities from a web archive. Alternatively, you can generate the plain text and use something like William Turkel’s named entity extraction walkthroughs.
In this case, I did the latter because I wanted to move through everything step by step. I generated the plain text using warcbase, used Turkel’s NER walkthrough, and ended up with a set of frequency lists like so.
My code to do so resembled this:
# CPP 201508 stanford-ner/ner.sh cpp-fulltext-201508.txt > cpp-fulltext-201508-ner.txt sed 's/\/O / /g' < cpp-fulltext-201508-ner.txt > cpp-fulltext-201508-ner_clean.txt # extracting locations and counting them egrep -o -f locpattr cpp-fulltext-201508-ner_clean.txt > cpp-fulltext-201508-ner_loc.txt cat cpp-fulltext-201508-ner_loc.txt | sed 's/\/LOCATION//g' | sort | uniq -c | sort -nr > all-cpp-fulltext-201508-ner_loc_freq.txt
Messy, but at every step of the way we have some files to play with.
The results then look like this (in the case of
140583 Canada 19065 Ontario 10429 Alberta 9728 Canada , 9307 British Columbia 7843 Toronto 4890 Vancouver 4830 Russia 4572 US 4524 Ottawa
Note a few problems – the US is referring to the United States, there are two Canadas, and eventually we’ll also see abbreviations (i.e. Ont. and Man. for Ontario and Manitoba) and alternate spellings (United States and United States of America).
To clean that up, my RAs (cheers to Patrick O’Leary and Katie MacKinnon) used OpenRefine to clean up the data. OpenRefine doesn’t handle duplicates (strangely enough) but does a good job of finding close things.. so combining Canada and Canada ,, United States and United States of America, and so forth. There are several good walkthroughs online to show you how to do that.
Finally, to eliminate duplicates, I created a simple pivot table in Excel.
How do we plot locations?
I have never found a very simple way to plot locations. There’s QGIS, but it’s ungainly and not terribly user friendly. We ideally want something that easily publishes to the Web. Many of the Geocoding APIs seem to change regularly, and my rough sense is that the situation today is actually worse than it was three or four years ago in terms of free, easy-to-access Geocoding APIs. Oh well.
I thought Google Fusion Tables might be a useful thing to try.
For the three web archives, we repeated the steps and imported the data into Google Fusion Tables. We used the geocoder built right into Fusion Tables and created plots for each of the web archives. There are certainly limitations to this:
- As noted, the geocoder is limited;
- The geocoder outputs locations as points, which has a total of 9 different shapes and sizes to explore – so it’s not the most user friendly (I’m colour blind so I feel your pain) – it would be better if we could use a gradient;
- Heatmaps can’t be published, only viewed. That’s a shame (plus the heatmap is skewed by the outlier values).
They’re below. Please click on them if you want to explore.
A few things to think about:
- First, we need to do the comparing. We’ll need to think about the best way to do that.
- Second, overlaying all the locations on one map seems to make sense to me. But the trick will be to make that intelligible.
- Third, it’s a good way to begin thinking about how to quickly visualize a web archive – could these serve as first passes?
- Finally, I wonder if we could eventually consider linking the content of web archives to these maps. Imagine wanting the content that discusses “Winnipeg” and then putting them all up to display further.
More possibilities than answers at this point, but that’s the fun of exploratory research, right?