Tracking the rise and fall of ideas throughout fifteen million books would have been impossible. Until now, thanks to the Google Books Ngram Viewer. Much like my previous post on Wordle tried to illustrate, we need to make sense of large quantities of information in order to do ‘big history’ and provide a context into which we can write our smaller studies. They’re also awesome for teaching or just playing around with and having (shock) fun with history.
On the chart at above right, we see a Google Ngram for two phrases: ‘nationalize’ in blue, ‘privatize’ in red. Does it surprise you? The idea of “privatize”ing is almost unheard of until the 1970s, and really picks up stream by the late 1980s and peaks in the 1990s. Conversely, nationalize slowly trends upwards until the 1970s, and then declines. This might not be surprising, but it’s an example. In this post, I’ll tell you what an ngram is, show some cool pictures, and hopefully drive you to have some fun with this.
What is an ngram? A dictionary defines it as “a sequence of variable characters that stands for a word or string of words in a corpus.” Blagh. What it really is are looking for PATTERNS within a broad ton of materials. If we’re looking for “nationalize,” we’re looking for a unigram. If there are two words in a pattern, it is a “bigram,” three is a “trigram,” and as they get bigger we just call them n-grams. At left is an example of a bigram: “youth delinquency.” Here, we can see an idea that really isn’t discussed until the early 1920s, skyrockets during the Second World War (soldiers were away, and people feared that youngsters were deprived of role models, etc.) and stays fairly high thereafter. Results are normalized against the numbers of books published, and are percentages rather than sheer appearances.
I’ve used this in my own teaching. You can play with the “youth delinquency” n-gram and ask students to postulate why there were certain heights and lows, based on their own reading of articles. It forces them to think a bit outside the bounds but also to come up with questions of their own.
The goal is to find something that maybe YOU CAN’T EXPLAIN YOURSELF – and then come up with something to look into. Let me know in the comments if you can come up with something cool!
1. Change the “corpora” that you’re searching, from American English and British English to books written in French, Hebrew, German, Spanish, Russian, English fiction, etc. This can be easily done with the drop down box above the graph.
2. Smoothing: the examples that I’ve set are using smoothings of three. Basically, if you don’t smooth, you see massive spikes – especially in the early years (google claims that before the 19th century only about 500,000 books were published). Here is the technical explanation: (and also accessed with a drop down box)
Often trends become more apparent when data is viewed as a moving average. A smoothing of 1 means that the data shown for 1950 will be an average of the raw count for 1950 plus 1 value on either wide: (“count for 1949” + “count for 1950” + “count for 1951”), divided by 3. So a smoothing of 10 means that 21 values will be averaged: 10 on either side, plus the target value in the center of them.
3. Play with dates: you can begin in 1500 and go all the way to 2000. If you want to be more specific, play with the date ranges and see a more specific one.
One thing to look out for is the “medial s” if you’re dealing with pre-1800 sources. As William Turkel explained in a visit to York, a certain four letter expletive turns up to a dramatic degree in the pre-1800 period – it then becomes “suck” from that period onwards. You can look at this chart (warning, explicit language) for a vivid demonstration of what Google and Turkel are explaining.
I’ll leave you with one biggie: a chart looking at the unigrams of Toronto, Montreal, Vancouver, Calgary, Edmonton, Saskatoon, Halifax, and Winnipeg. Here we can see how Toronto really takes off after the early 1960s, reflecting its growth and current economic position. Whether this is a good thing or bad thing I’ll leave to you.
Pretty neat stuff, eh?
Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)