Historians Love JSON, or One Quick Example of Why it Rocks

I was looking into the Canadiana Discovery Portal API at the behest of a colleague, and while tweeting my excitement at the results, had another Canadian colleague note that he also loves the JSON format. It made me realize: it kind of all probably does seem a bit incomprehensible. So why should a historian like JSON, and what’s so cool about an API like Canadiana’s?

Note: this post runs on the assumption that you’ve read or are open to such things as the Programming Historian.

In a nutshell, JSON is a format that lets you transmit attributes and values. Say I own three iPhones and one iPad (I don’t), you might get results that look like this:

{
"iphones" : "3",
"ipads" : "1"
}

What Canadiana has done is basically let you grab the data that would normally come in HTML format in JSON format, which makes it really easy for a computer program that you’re writing talk to it. So if you’re looking for search results relating to Waterloo you’d get these results if you did it the normal way – and this way if you requested JSON format by appending &fmt=json to the URL.

I won’t reproduce the documentation, which you can grab here, but to me the most exciting facet of this API is that you can request full-text documents via it. It’s a scraper’s dream.

An Example of What We Can Do With Such Things
Enough abstract details. Here’s what I did today to grab a ton of material relating to a specific query.

Step One: Finding the Keys for Files to Download

I did a search on a topic that I’m interested in. I’ll change the details so it’s about what I like versus what my colleague was looking for. I like trains, so let’s use ‘railroad*’ so we get both railroad and railroads. Let’s further narrow our search to the 19th century.

So if we want to do a search and get results via JSON, this URL would do:

http://search.canadiana.ca/search?df=1800&dt=1900&q=railroad*&fmt=json

There are 2,369 pages of results. Lots of trains.

If we look at the JSON records that are being generated, we see that each record contains a record unique key, in the aptly named ‘key’ field. For example, the first result has a key value of: oocihm.8_04743. If we build a list of those, we suddenly have the ability to grab full-text records as samples or even, probably if we refine our search a bit, as a whole dataset.

To get all of the keys, I wrote a quick script to extract them from all the entries. Remember, there are 2,369 pages of results (which you could scrape from the last entry of each JSON page).

url="http://search.canadiana.ca/search/"<>ToString[x]<>"?df=1800&dt=1900&q=psycholog*&fmt=json"

I solved for x, where x was a value between 1 and 2369 and on each of those pages, grabbed all values in the ‘key’ field. I did this using Mathematica but you could extract the info using Python. It’d be a good ‘learning to program’ exercise.

Step Two: Grabbing Some Full Text Records

Now we’ve got a list of keys. Check out this format:

prefix = "http://eco.canadiana.ca/view/";
suffix = "/?r=0&s=1&fmt=json&api_text=1";

We take the key – let’s use a cool entry – oocihm.16278 -although we now have several thousands of them, pop it into the middle of that URL so it is prefix+key+suffix and we suddenly have:

http://eco.canadiana.ca/view/oocihm.16278/?r=0&s=1&fmt=json&api_text=1

The api_text=1 tells the system that we want the full text. And then, Bob’s your uncle, you’re getting full text.

So I can then take these two components: the quick script that grabs all those keys, and then that quick script that generates the full-text of each of those keys, and then suddenly start accumulating a full-text database. Since I’m a responsible web consumer, and I don’t want to tax their servers, I put a pause in between each record request and will probably accumulate these files over a longer period of time.

3 thoughts on “Historians Love JSON, or One Quick Example of Why it Rocks

  1. […] to search Early Canadiana Online came to me after reading Ian Milligan’s blog post ‘Historians love JSON, or one quick example of why it rocks‘, which explained how to use JSON files to access full-text versions of content from the […]

  2. […] a smaller corpus, which produces results that reflect a particular context. As I tinkered with the JSON files that make up the scaffolding of the Canadiana database, one of the projects that came to mind was […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s