I was looking into the Canadiana Discovery Portal API at the behest of a colleague, and while tweeting my excitement at the results, had another Canadian colleague note that he also loves the JSON format. It made me realize: it kind of all probably does seem a bit incomprehensible. So why should a historian like JSON, and what’s so cool about an API like Canadiana’s?
Note: this post runs on the assumption that you’ve read or are open to such things as the Programming Historian.
In a nutshell, JSON is a format that lets you transmit attributes and values. Say I own three iPhones and one iPad (I don’t), you might get results that look like this:
"iphones" : "3",
"ipads" : "1"
What Canadiana has done is basically let you grab the data that would normally come in HTML format in JSON format, which makes it really easy for a computer program that you’re writing talk to it. So if you’re looking for search results relating to Waterloo you’d get these results if you did it the normal way – and this way if you requested JSON format by appending
&fmt=json to the URL.
I won’t reproduce the documentation, which you can grab here, but to me the most exciting facet of this API is that you can request full-text documents via it. It’s a scraper’s dream.
An Example of What We Can Do With Such Things
Enough abstract details. Here’s what I did today to grab a ton of material relating to a specific query.
Step One: Finding the Keys for Files to Download
I did a search on a topic that I’m interested in. I’ll change the details so it’s about what I like versus what my colleague was looking for. I like trains, so let’s use ‘railroad*’ so we get both railroad and railroads. Let’s further narrow our search to the 19th century.
So if we want to do a search and get results via JSON, this URL would do:
There are 2,369 pages of results. Lots of trains.
If we look at the JSON records that are being generated, we see that each record contains a record unique key, in the aptly named ‘key’ field. For example, the first result has a key value of: oocihm.8_04743. If we build a list of those, we suddenly have the ability to grab full-text records as samples or even, probably if we refine our search a bit, as a whole dataset.
To get all of the keys, I wrote a quick script to extract them from all the entries. Remember, there are 2,369 pages of results (which you could scrape from the last entry of each JSON page).
I solved for x, where x was a value between 1 and 2369 and on each of those pages, grabbed all values in the ‘key’ field. I did this using Mathematica but you could extract the info using Python. It’d be a good ‘learning to program’ exercise.
Step Two: Grabbing Some Full Text Records
Now we’ve got a list of keys. Check out this format:
prefix = "http://eco.canadiana.ca/view/";
suffix = "/?r=0&s=1&fmt=json&api_text=1";
We take the key – let’s use a cool entry – oocihm.16278 -although we now have several thousands of them, pop it into the middle of that URL so it is prefix+key+suffix and we suddenly have:
api_text=1 tells the system that we want the full text. And then, Bob’s your uncle, you’re getting full text.
So I can then take these two components: the quick script that grabs all those keys, and then that quick script that generates the full-text of each of those keys, and then suddenly start accumulating a full-text database. Since I’m a responsible web consumer, and I don’t want to tax their servers, I put a pause in between each record request and will probably accumulate these files over a longer period of time.