Press Release: Digital archive of political parties digs deep for Election 2015

Screen Shot 2015-08-26 at 3.37.32 PM

(X-posted from the University of Waterloo’s Media Relations page)

If you ever suspected Canadian politicians flip-flopped on a specific issue, or wondered where they stand on another, a new online tool will help you easily find out for sure.

Professor Ian Milligan at the University of Waterloo is charting the content of millions of archived political web pages spanning the last decade, allowing the public to compare what Canadian political leaders and pundits said in the past compared to now.

WebArchives.ca pulls from collections that the University of Toronto Library has been collecting for a decade. Professor Milligan and his research team at Waterloo, as well as project collaborators from York University and Western University made the data searchable and accessible, drawing on code that staff at the British Library developed.

“We’ve got access to a collection of 50 archived websites from political parties and interest groups, allowing you to search them back to 2005,” said Milligan, a professor in the Department of History at Waterloo. “It means, for example, that anyone can find out what parties and groups said about climate change or free trade in the 2008 or 2011 election, or at any point between elections.” Continue reading

Quick Post: Web Archive Sentiment Analysis with Mathematica

The Green Party tends to be happy, I guess? At least in October 2005.

I’m continuing to be impressed with the new features found within Mathematica 10.2 (see my recent posts on geo-extraction and extracting person entities). Sentiment analysis is a snap, although the findings will probably need a bit more exploration. We’re trying it out on the Canadian Political Party and Political Interest Groups collection, which you can also play with at webarchives.ca.

In short, “positive” sentiment analysis within a political party tends to find happy taglines, advertisements for community meetings that really do stress “fun” and “entertainment,” and announcements referring to great “fanfare” and meetings at places. The Green Party, Canada’s smallest major party, had a lot of pretty casual content back in 2005, for example, lending itself to this (pub nights, for example).

A positive example:

The Halton Federal EDA is hosting a Fall Fun Day Oct. 16th from 11:30am-3:30pm at Lowville Park on Guelph line just south of Derry Road. We will have pumpkin decorating, corn on the cob, scavenger hunt and back pack safety tips to save your posture and your back. We welcome greens to attend this fun afternoon with us. This is the first event for this newly formed association which is showing its strength with an event so soon after its assiciations creation on August 25 2005.

On the other hand, their negative content speaks to their frustration as they try to make their way as a party:

Many people are cynical about politics. They think that nothing will ever change. They say we can never have the government we really want. The Green Party understands that frustration – we are frustrated too. But cynicism and frustration will not solve our problem – we can curse the darkness, or light a candle. Will we ever have a better choice than the lesser of two evils? Yes, if we vote for a party who will not settle for the status quo. What will we do when our lands, our waters and our ecosystems can no longer support the demands we make of them? If we manage our resources well, they will sustain us. The greatest mistake we can make is to think that we have no power.

Unfortunately, the negative analysis also swept up a lot of “Internal Server Errors” – they are rather sad, indeed. By deleting duplicate text like that, we were able to get rid of those save for one.

The timings on this were impressive, with about 420 seconds to process each of positive and negative sentiment for an annual datescrape of the Green Party of Canada’s page. Continue reading

Using Extracted Names to Explore Web Archives

It should be no surprise that in 2009, the prominent NDP leader Jack Layton was the most frequent person mentioned on their site.

It should be no surprise that in 2009, the prominent NDP leader Jack Layton was the most frequent person mentioned on their site.

Yesterday morning, I used Mathematica‘s new geographic processes to use our Warcbase NER output in order to generate maps based on web archived locations – I then wondered what we could do with the frequency of individuals?

One challenge I have often encountered with NER output is the “what now?” question. We generate fantastic lists of frequently appearing individuals, organizations, and people, but apart from exploring tables or finding use for network analysis, I haven’t been terribly compelled by this (one exception is the Trading Consequences ‘location cloud’ visualization, which we’re currently trying to rip offborrow from)

Mathematica has powerful integration with Wolfram Alpha’s evergrowing databases, which contain large amounts of information on influential and even not-so-influential people: the sorts of people that are likely to show up in political web archives like I am using. Consider: Prime Minister Stephen Harper’s page and the sheer amount of data, or even the perhaps less-internationally-notable politician Rona Ambrose. I wondered if we could connect our NER frequency output with this database to find interesting ways to visualize the frequency with which people appear. Continue reading

Using Mathematica to Plot Locations Mentioned in Web Archives

We’ve been using warcbase to extract entities from different domains within the Canadian political parties and interest groups collection. While previously I’ve used the Google Maps API/Many Eyes to quickly visualize these things, this morning I wondered what we could do with Mathematica 10’s (relatively) new geographic visualization services.

The results were promising. Semantic Interpretation, in particular, is pretty good although I do need to learn to tweak it a bit better – consider the results here:

Each correctly isolated entity has quite a bit of information - Calgary for example is also attached to the administrative unit of Alberta and the country of Canada.

Each correctly isolated entity has quite a bit of information – Calgary for example is also attached to the administrative unit of Alberta and the country of Canada.

As a quick and dirty workaround to ambiguity, I used Wolfram|Alpha to grab the latitude and longitude of each point and then map them. The results, pictured here for Conservative.ca, were very promising:

Conservative-Frequency-Map

Distribution of locations mentioned in the Conservative Party of Canada’s website from February 2009.

We can see, for example, the relatively high frequency of Calgary (home to the Conservative Party of Canada), and the extremely low frequency of the city of Toronto (Canada’s largest city, albeit not a large base for the party).

We can also zoom in on different sections of the world. Continue reading

Soft Launch of WebArchives.ca

Screen Shot 2015-07-29 at 12.21.32 PMWeb archives have a lot of very useful information in them! As websites disappear every second on the Web, we need to save sites now. Luckily, we’ve been saving sites since 2005: even if they don’t exist on the live web today, we may have them saved for historical research.

This is where WebArchives.ca comes in, which we’ve been “softly” launching this week – a public kicking of the tires (tell your friends about us). This is hopefully the first of many portals that we’ll be putting up on this site, using different research tools. In a nutshell, we provide access to the University of Toronto’s Archive-It Collection of Canadian Political Parties and Political Interest Groups, which they have been collecting since late 2005. For information on what is within this collection, please see the University of Toronto’s page. This site uses the UK Web Archive’s shine interface, which they have made available here.

We can use web archives to see what this page used to contain!

We can use web archives to see what this page used to contain!

For example, did you know that the Green Party of Canada ran a public blog on their website back in 2008, where anybody could write in? Today, if you try to visit them, you’ll receive a “403 Access Denied error”). Look for yourself: on our “advanced search” page, you can search “harper” and “fascist” with a proximity of “25” to see some provocative posts on this Green Party blog (results here). These are just a few random examples: you can certainly find hundreds more as you begin to explore through our portal.

Relative trends of

Relative trends of “recession” and “depression” in our political collection.

With literally millions of pages – there are 14,490,355 “documents” in the archive found here – you sometimes need to pull your gaze back to see how ideas have risen and fallen. For example, we can discover how terms like “depression” and “recession” waned and rose over time, through our trends view. We’ve tentatively found that left-wing groups tended to use the word “depression” more than centrist or right-wingers, who used “recession” more during the economic crisis? There is a literal treasure trove of stories to be found in these collections, limited only by your imagination.

Acknowledgement and Thanks to the Team

This has been a joint production! At Waterloo, I’ve been working with Shawn Dickinson and Danielle McDonald on implementing this portal (I have two other RAs – Dave Hussey and Jeremy Wiebe – who’ve been working on other projects related to digging into the WARCs themselves). Jimmy Lin, newly arriving at Waterloo, has been making it possible for us to index material – using warcbase and the UK Web Archive’s hadoop indexer – in something shorter than a week a collection. At York University, where this server sits, Nick Ruest has been doing the heavy lifting to make this site a (pretty) reality. At Toronto, Nicholas Worby gave us access to these files. At the Internet Archive, Jefferson Bailey got the ball rolling with the Archive-It Research Services and connecting us to the Toronto folks. Finally, at Western, Bill Turkel’s also providing support and soon some cool Mathematica hacks.

And – of course – the UK Web Archive got the ball rolling with Shine!

Setting up the Termite Data Server: A New Walkthrough

A Termite Topic Model Visualization of the Green Party’s Website from September 2007.

A Termite Topic Model Visualization of the Green Party’s Website from September 2007.

We’ve been working on various visualizations for our web archives collections. One bottleneck was topic modeling using MALLET: both due to limitations on just how fast we can get it running, but also into how to make the results usable for the average user.

Termite was one such option. While it has decent documentation, it can be difficult to munge data into it.

Shawn Dickinson, one of my RAs in the Web Archives for Historical Research Group, wrote up some great code that takes a directory of text files and prepares them for Termite.

As with all our walkthroughs, it is available in our GitHub repository as Setting up Termite Visualizations on OS X”. Feedback is always appreciated, either here or by submitting a Pull request.

Web Archive Legal Deposit: A Double-Edged Sword

I couldn't take pictures of the web archive, but here's the lineup to enter the archive at 9:30am!

I couldn’t take pictures of the web archive, but here’s the lineup to enter the archive at 9:30am!

I’ve heard so much about legal deposit in the context of web archiving, and have been enthralled with what it represents: a recognition that born-digital sources are today’s documentary record, the need to preserve it more, and the institutional and legal commitment to make sure that happens. If we’d had non-print legal deposit in 2008, historians might today be studying AOL Hometown, one of the early mass deletions on the Web.

But I knew that legal deposit came with some restrictions. In return for the legal authority for collecting libraries to collect all of this information, they were bound by many of the restrictions placed on print books: on-site consultation only, limitations on reproduction, and a maximum of one person at a time viewing a website.

I wondered how this would all work out, so on my way back from the Web Archives as Scholarly Sources conference in Denmark, I decided to make a quick two-day stop in London. There, I had the opportunity to stop by the UK Web Archive at the British Library. Helen Hockx-Yu, the Head of Web Archiving there, gave me a guided tour of the Web Archive and an opportunity to see both the user-facing interface as well as their back end. It helped complicate some of my views.

The User Experience: A Mixed Bag

If you want to view the UK’s legal deposit web archive, you need to physically go to one of their six legal deposit libraries: the British Library at King’s Cross in London, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries at Oxford University, the University Library of Cambridge University, or the Library of Trinity College, Dublin. Armed with a reader pass, you can go into one of the reading rooms and sit down at one of their reference terminals. Continue reading