Extracting and Visualizing Postal codes in the .ca TLD

The results! To find out what these are and how they were generated, read on.

The results! To find out what these are and how they were generated, read on.

After a great few days at the Working with Internet Archives for Research (WIRE) workshop at Harvard University in Cambridge, I’ve managed to find a few minutes between applying for grants and talking about research to do some tinkering with the 80TB Wide Web Scrape sample I’ve been playing with.

The Postal Code Idea

One interesting thing that the British Library is doing with their domain-wide crawls is to search for British postal codes – this helps librarians decide if a website falls under the provisions of legal deposit if it’s not part of the .uk top-level domain or registered under a British address. They go to ‘contact us’ pages, search for postal codes, and a librarian makes a judgment call.

So I thought – what a fun idea! What would it look like if we did it on my subset of .ca data? We’d be seeing lots of universities, businesses, etc. (the kinds that put their postal code in footers, for example), but as a rough indication it might give us a sense of some economic activity. More importantly, it could be yet another component of a dynamic finding aid. Some blue sky ideas would be (a) could we imagine a visualization where we click on the postal code nodes and see recommended pages? and (b) what would longitudinal data look like – could we see economic activity flow from say Ontario (the traditional industrial heartland of Canada) to the western provinces (the resource rich area).

Over two hours or so of tinkering, I got this up and running. Read on to see what I did.

Implementing It

The first step was finding a good regular expression (obligatory xkcd link here). We’ve got a section in our Historian’s Macroscope book that deals with constructing these yourselves, but in this case I know we’ve got lots of postal code regular expressions. These can validate the text you’re inputting. Canadians encounter bad regular expressions all the time: some websites expect our postal codes in the format A1A 1A1 (with a space), others A1A1A1 (without a space), and not enough accept both forms.

Without making calls to Canada’s Post Address Data or the Google GeoCoder API, there’s no quick and free way to validate postal codes on the scale that I’m working. So the following regular expression works as a starting point:

[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}

Anywhere in a string, with or without a space in between, if it fits the basic geographical criteria of a postal code we’re good to go. I’ll sort things out later when I start plotting them.

On the command line then, in my directory with the .ca data (spread over tons of XML files), I run:

grep -r -E -o "[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}" ./ > postalcodes.txt

This starts extracting the postal codes and saving them in the postalcodes.txt file. It didn’t take too long. Once generated, I had a decent amount of data: 73,569 postal codes (a few of these might be other things, such as hex colours, which we’ll hopefully catch later on). Some of the file names are in the default grep output, so the next command got rid of those:

grep -E -o "[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}" postalcodes.txt > postalcodes-flat.txt

This just pulled out the actual postal codes, and then the next command gets rid of the spaces so they’re all in the same format.

cat postalcodes-flat.txt | tr -d " \t" > postalcodes-nospace.txt

(I know, I know, it’s probably misusing cat, but it’s working for me)

Now we have a file that looks like this:

Lots of postal codes.

Lots of postal codes.

For ease of use, I saved it as a CSV file.

The next step is to throw all these into a GeoCoder. There are two options: the Google API allows free usage of 2,500 calls a day – documentation here. My dataset only had 7,765 unique postal codes, but even then I’d either have to pay or have patience. I don’t, so I turned to Geocoder.ca and purchased some credits (around $20 to start). I sent the CSV in, and about four minutes later I had results.

In went the code at left, out came lat long, cities, provinces, etc.

In went the code at left, out came lat long, cities, provinces, etc.

If we were going to do this a lot, I would probably purchase the postal code database from somebody and run local queries. This’ll do at an exploratory stage though. I then tallied and sorted these results by frequency (you could use bash, I used Mathematica as it’s my more native language), and then we’re ready to go. A few visualization choices presented themselves, but my friend and colleague Jim Clifford, a historical GIS guru at the University of Saskatchewan, suggested I revisit QGIS. I’d tinkered with it a bit about two years ago, but hadn’t returned to it since. After a quick installation, a few steps remained: (a) to add a delimited text layer of a CSV file that had lat/long + frequency; (b) to add a base layer, using Jim’s recommendations from here; and to tinker with various ways to display scale. Given the wide disparity in frequencies, with some postal codes appearing tens of thousands of times, most appearing a few dozen, and a few appearing only once or twice, I broke the data into five breaks and visualized it both by colour and size.

Results

The results by size:

Postal codes as they appeared in the .ca top-level domain sample. Darker and bigger indicates larger (see legend at left).

Postal codes as they appeared in the .ca top-level domain sample. Darker and bigger indicates larger (see legend at left).

Zoomed in for Southern Ontario and with place names included:

Screen Shot 2014-06-19 at 4.35.27 PM

We see major hubs: Toronto, of course, Ottawa (not surprising given the amount of government offices within the .ca domain). Kitchener-Waterloo punches above its weight due to the technology sector, compared to nearby London (the same size in population). We see the shape of the lake as well.

The results! To find out what these are and how they were generated, read on.

The results! To find out what these are and how they were generated, read on.

Download the file here. Raw data available on my GitHub.

Next Steps

This is pretty exciting. Again, the next steps should be:

  1. think about doing this longitudinally. I really do think if you could get data from 1996 – 2014, you’d see dramatic shifts. This is what really excites me about this.

  2. think about this as an augment to a Canadian-specific finding aid. If you’re looking at say, Sudbury, Ontario businesses, then postal codes are a good starting point.

  3. imagine this as a dynamic visualization, somehow bringing you to specific businesses/institutions that are within these different cities.

2 thoughts on “Extracting and Visualizing Postal codes in the .ca TLD

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s