SSHRC Open Access Policy: It’s a Big Deal, But Don’t Worry – It’ll be OK

The long-awaited Tri-Agency Open Access Policy on Publications arrived today. Just before I went into a meeting, I decided to tweet a quick announcement about it which I think for many Canadian scholars had been the first they’d heard of it!

This move will require all grant recipients funded by the Social Sciences and Humanities Research Council (SSHRC) or its sister agencies the Natural Sciences and Engineering Research Council (NSERC) or the Canadian Institutes of Health Research (CIHR) to make their peer-reviewed journal publications freely accessible within twelve months. I’ve been waiting for this news to break for months: it was apparently supposed to surface back in October 2014 during Open Access Week, but it’s been suspected that the dreadful shootings in Ottawa that month may have delayed it. That’s rumour, though, so don’t put too much stock in that.

Given the response to my little tweet, I thought a blog post might be useful. Bear in mind that this was written in roughly the hour following the announcement, so more details may emerge and I’m sure thoughts will evolve over the next few days and weeks.

I’m personally happy about this move for a number of reasons:

Topic Modeling Web Archive Modularity Classes

Screen Shot 2015-02-05 at 1.23.18 PMThis is a brief follow up to Tuesday’s post. By allowing some recursive downloading, I grabbed quick snapshots from the Wayback Machine of the sites that fell within both the Conservative and Liberal party websites in 2006 and 2014. After converting to text, the Mark I eyeball found some interesting things: economic development for the Conservatives, more social justice websites for the Liberals.

But with web archives, the Mark I eyeball isn’t enough. Topic modelling turned up some interesting results, however. I’ve pasted some findings before, but some highlights:

  • Finding child care plans in the 2006 Liberal modularity class: a perennial promise of that party, this is a good thing to find;
  • Way more emphasis towards Aboriginals in the Conservative sphere from 2006. I’m not quite sure why, but it’s at least an area to dig more in;
  • Current data is very good: the Conservatives care about our economic action plan, and that appears in 2014;
  • The Liberal’s attachment to social movements comes through.

Using Modularity to Find and Explore Web Archived Communities

While playing with my WAT files of Canadian Political Parties, I wondered more about finding community and clusters of websites. Using Gephi’s functions, could we learn something about the websites that cluster around a specific political party?

This wasn’t the most successful experiment, but I think it lays the groundwork for some future explorations with metatext, and perhaps using Gephi’s command line functions to begin automating this sort of analysis. But let me show you what I found.

The initial impetus came from this:
Screen Shot 2015-02-03 at 10.29.09 AM

At left, we see the links that come out and into the Conservative Party of Canada's website; at right, the links in and out of the Liberal Party of Canada. Links themselves aren't the best example, but Gephi has a modularity function built in. And in that, there's potential to help us learn more about these massive dumps of data..

AHA Talk: The Promise of WebARChive Files

This paper was given at the American Historical Association’s annual meeting in New York City on January 5th, 2015. It was part of the Text Analysis, Visualization, and Historical Interpretation panel. My thanks to my co-presenters and especially Micki Kaufman who organized the panel.

The text that follows may not be exactly what I said, but is based on my speaking notes with a bit of memory filling in here and there.

AHA Talk.001

AHA Talk.002

Hello everybody, I’d like to begin with a somewhat provocative opening:

I believe that historians are unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. Web archives are going to transform the work we do for a few main reasons:

Guided Tour of the Canadian Political Party Web Archive

What on earth is this? Spaghetti and meatballs? Turns out it's full of useful information when we pry it apart.

Those who follow me on Twitter know that I’ve been playing with the WAT file format, and in particular have been undertaking a crash course in Gephi. It’s been really rewarding! It’s already led me to dig into the Wayback Machine and find out that the No Shari’a law campaign transitioned to an anti-Iranian embassy campaign around the same time that it became less relevant within the link structure of the Canadian political sphere.

But in short, this stuff matters because it takes us from a list of files that are opaque and hard to deal with – WARC and WAT files – and into something that we can now work with, and begin to ask research questions about.

If you’ve got a fantastic computer, you can try playing with the PDF export of this. I could make this prettier but I think there’s a limited audience for this.

My sincerest thanks to Micki Kaufman – her help (and willingness to walk me through some of the data wrangling on a screenshare) made this possible.

Some preliminary thoughts on the WAT file below and what we can learn from this sort of analysis. I’m not a YouTube personality, and Gephi can be occasionally persnickety – especially when one foolishly tries to work remotely in the morning on their laptop – but I think it showcases some of the possibilities. More soon.

Using Gephi to Explore Web Archive Structures

The evolution of inbound links, Canadian political movements, 2005-2014.

The evolution of inbound links, Canadian political movements, 2005-2014.

In my last post, I discussed how I could take WAT files and extract Gephi graphs from them. In this post, I want to show how I’ve moved past that and am now working with dynamic Gephi graphs. Some of the fruits of this can be seen in the animated GIF at left. It’s been two days of learning, which is among my favourite things to do! Again, this is super preliminary: mostly just liveblogging some of the questions that are popping into my head from these files..

To generate these, I did the following (mostly following the Import Dynamic Data tutorial, as well as getting helpful hints from Micki Kaufman):

  • Drew on a Gephi file generated for each year of these collections (thanks to my RA, Jeremy Wiebe);
  • Implemented a number of tweaks: ran a modularity detection algorithm and coloured clusters accordingly, played with the filter for ‘Topology -> In Degree Range’ and generated versions for 2 and 4 limits of in-bound links, and made sure to extract these to a new workbench;
  • Each workbench was then exported as a GEFX file;
  • I then started a new project, and opened each GEFX file in turn: making sure to select ‘time series’ and filling out the box asking for the date (I used the year value).

The results have been illuminating, although analysis is obviously still to come. I wish there was a way to export the 'chart' results that one can generate in the 'ranking' section of the workbench, but apparently this doesn't exist.

Accessing Historical Data En Masse

Click to download the slide deck.

Click to download the slide deck.

Today I’m giving a workshop at the Massachusetts Institute of Technology on “Accessing Historical Data en Masse.” Slides, links, and more details are available on this standalone page. It’s part of a broader seminar on “Research, Teaching, and Digital Humanities” held as part of their World History Seminar.

Hopefully others find it useful. While the slides aren’t a perfect substitute for having me go through the examples – I hope – it’s being videotaped. Perhaps we can make that available?