By Ziquan Wang, Borui Lin, Ian Milligan, and Jimmy Lin
While Americans are busy enjoying their Fourth of July, us Canadians are digging into data… and indeed, we wanted to share some research recently presented at the Web Archives and Digital Libraries workshop.
Shortly after Donald Trump’s inauguration as President of the United States, eagle eyed observers noted a crucial difference in his webpage as compared to his predecessor, President Obama. Whereas Obama’s information page had listed the three branches of the US government: executive, judicial, and legislative, Trump’s page listed only two.
Examples like this made our research team at the University of Waterloo wonder: could we systematically begin to track the changes in discourses, priorities, topics, and beyond between two US Presidential elections, and more so, could we do so on a budget? As I’ve argued elsewhere, web archives are of crucial importance for historians seeking to understand any period after 1996. Yet the scale requires us to turn to digital methods. We cannot go page by page through websites, but rather we need tools to extract the information that we need. Could we “distantly read” websites to notice shifts like observers did in the early days of the Trump administration?
Luckily for us, students had just finished taking Jimmy Lin’s (awesome) Big Data Infrastructure course and wanted to exercise their skills. The amazing Ziquan Wang and Borui Lin joined us and set out to explore shifts between two American presidential administrations.
But first, we needed the data…
To do so, we turned to the Internet Archive’s “End of Term 2016 Post-Inauguration Crawls” collection, and downloaded the last Obama website capture and the first Trump one. The sizes were initially a bit vexing, as Obama’s page was roughly 320 GB of content whereas Trump was 13 GB of content. While surprising, it became clear that the Trump administration largely started from scratch: the Drupal theme was the same, but the content all brand new – and fairly minimal!
Our sincerest thanks here to the Internet Archive, and in particular Jefferson Bailey and Lori Donovan, who helped make the data possible.
With the downloaded WARC files, we then used Warcbase to extract the plain text of both White Houses. To do so, we used the Altiscale cluster, which has been generously provided to Jimmy’s Big Data Infrastructure Course. This in hand, the goal was to classify the text of the two websites.
For this project, we felt that the traditional approach to topic classification using supervised machine learning wouldn’t work. This sort of approach breaks down like follows:
- Obtain human annotated training data, i.e. we might tell the computer that the sentence “Humans are warning the planet” belongs to the “CLIMATE” topic.
- Train a text classifier on the human annotated training data, so it gets a sense of what “CLIMATE” means.
- Apply that text classifier to unseen text and assign topics based on what it had learned.
- And analyze the results.
The problem for us, and for many other digital humanists, is that obtaining human annotated training data can be expensive. While you could do it yourself, or hire research assistants, or beyond, when working with web data at scale there are often more things that you want analyses on than you have human annotated datasets. This is a prevalent problem and we wanted to think of a creative way to do this.
Bootstrapping to the rescue!
We decided to take the traditional topic classification approach (listed in the four bullet points above) and pretend that we had human annotated training data. Instead of obtaining human data, we wanted to find a way to jumpstart the process. The new process might look like the following:
- Obtain annotated training data from some other source. i.e. we might tell the computer that the phrase “Global warming” belongs to the “CLIMATE” topic. We would use lists of keywords and phrases for this.
- Train a text classifier on this hastily comprised training data, so it gets a sense of what “CLIMATE” means.
- Apply that text classifier to unseen text and assign topics based on what it had learned.
- Re-run the classifier using the new sentences, so that it learns what sort of words appear in phrases that have to do with “CLIMATE” beyond our keywords.
- And analyze the results.
In short, keyword matching would be our entryway into supervised machine learning. If a sentence contained one or more keywords, we would assume that it is about a particular topic.
We used the keywords found in Emily Gade, John Wilkerson, and Anne Washington’s 2017 piece “The .GOV Internet Archive: A Big Data Resource for Political Science.” Here you can see a few example keywords for each of the four topics we wanted to use.
adaptation, alternative energy, anthropoc*, anthropog*, carbon, cfc, clean energy, climate change, climategate, co2, …
adjustable-rate mortgage, bailout, bubble, capital requirement, conservatorship, exposure, fannie mae, financial fraud, foreclosure, freddie mac, …
health care, healthcare, health insurance, pharmaceutical, drug abuse, alcohol abuse, Obamacare, prescription drug, long term care, rehabilitation, …
9/11, al-qa*, alien smuggl*, arms prolifer*, arms smuggl*, arms transfer, assassin, atrocit*, authoritarian, ballistic missile, bin laden, …
And then we began to dig into the two archives, as a way to see how topics might have shifted.
The Tale of Two White Houses: Why Bootstrapping Worked
Even over the Internet, I can feel your hesitation around this method: aren’t we taking a sophisticated approach like machine learning and reducing it to keyword matching? It turns out, based on our comparisons between the classifiers we bootstrapped, that it’s a good method. Here’s why:
- We started with reasonable, yet noisy, labels. Constructed by subject-matter experts, they’re a good starting point.
- The machine learning algorithm subsequently discovers good correlate features!
- For example, in our lists, “Obamacare” was a keyword, but “ACA” was not. In training data, “Obamacare” and “ACA” co-occur.
- The classifier gets the two together, and the second time you run the classifier, you get a bit more bang for your buck.
- Soon, you have a pretty robust classifier that’s moved beyond the list of keywords you started with.
Our approach was thus to train two classifiers. The first we trained using the list of keywords from Gade, Wilkerson, and Washington on the Obama White House, which we then used to classify both Obama and Trump. The second we trained with the keywords on the Trump White House, which we then used to also classify the two administrations.
Bootstrapping and Biases
Our bootstrapping approach brought its own biases, which was also a useful way to of thinking about the two administrations. If you train on Obama, for example, you get a topic classifier biased towards the Obama worldview. Our favourite example had to do with climate. If you bootstrap your climate topic on Trump, the classifier will never learn that “man-made” might be a feature of a climate topic – the two words don’t co-occur.
Yet beyond these differences, we were surprised at how well they lined up. Even with two radically different sets of training data (it doesn’t get too much more politically divergent than comparing Obama and Trump!), the topic classifications were pretty similar between the two.
The Results: What topics did we find in both White Houses?
The two tables below show the results of the two White Houses, first the Trump-trained classifier and secondly the Obama-trained classifier.
Note that the percentages (on the y axis) are relatively low. As we only trained four topics, we would expect to see the majority of topics not belonging to any of them.
But crucially, we can see that the topic classification approach generally works well! For the Trump White House, both classifiers found the order of topics to be security, then health, then climate, and then finance, and all within a small margin. For Obama’s page, the Trump-trained classifier tended to pick up more health and security topics than the Obama trained one, but still relatively close.
More importantly for a digital humanist, the raw data is also helpful. One of the key outputs is a list of documents in the form:
URL, Topic, Ratio of relevant sentences to overall sentences, total number of classified sentences
For example, we now have generated lists of websites pertaining to climate change for Obama:
- https://www.whitehouse.gov/blog/2014/11/12/us-and-china-just-announced-important-new-actions-reduce-carbon-pollution CLIMATE (0.7619048, 21)
- https://www.whitehouse.gov/the-press-office/2011/09/02/white-house-officials-hold-background-conference-call-regarding-clean-ai CLIMATE (0.5, 2)
- https://www.whitehouse.gov/blog/2011/04/15/west-wing-week-open-business CLIMATE (0.33333334, 6)
- https://www.whitehouse.gov/blog/2010/03/15/so-you-want-boost-exports-have-i-got-a-program-you CLIMATE (0.3181818, 22)
- https://www.whitehouse.gov/the-press-office/2016/02/10/statement-press-secretary CLIMATE (0.4, 5)
- https://www.whitehouse.gov/the-press-office/2015/11/16/new-report-provides-authoritative-assessment-national-regional-impacts CLIMATE (0.45945945, 37)
- https://www.whitehouse.gov/blog/2011/10/07/link-between-american-energy-and-prosperity CLIMATE (0.3809524, 21)
Or security for Trump:
- https://www.whitehouse.gov/trump-stands-with-israel SECURITY (0.33333334, 3)
- https://www.whitehouse.gov/the-press-office/2017/02/17/readout-presidents-call-president-beji-caid-essebsi-tunisia SECURITY (0.75, 4)
- https://www.whitehouse.gov/the-press-office/2017/01/27/executive-order-protecting-nation-foreign-terrorist-entry-united-states SECURITY (0.38271606, 81)
- https://www.whitehouse.gov/the-press-office/2017/02/13/readout-vice-presidents-call-president-michel-temer-brazil SECURITY (0.33333334, 3)
And the sentence level data is fascinating. For example, the top-ranked security sentence for Trump is the following (chilling) one:
And I think some of the announcements today indicate the fruits of that effort , which is that the President has pushed his national security team to determine if additional travel restrictions could be put in place that would make the American public more safe. (SECURITY)
Or this ranking one from Obama on Climate is a reminder of a different approach to climate change:
The President committed to lead international efforts to address climate change : Secured the Paris Agreement , where more than 190 countries agreed to a framework for global action on climate change . (CLIMATE)
Using the arresting example of Obama and Trump, we’re advancing here a new method of topic classification. It holds several advantages, especially for digital humanists:
- Topic modelling, for example, is another example of topic classification. Yet it can be hard to sell to peer reviewers or even colleagues. LDA is a bit of a black box for a humanist or social scientist.
- I can explain this method and understand it – it melds the “bag of words” approach that digital humanists are familiar with, to simple machine learning algorithms.
There is no black box, at least to me. That doesn’t mean there aren’t disadvantages too: you need a “bag of words,” either from a subject-matter expert or your own research team, and perhaps more importantly, the biases of the classifier are inscribed. In our case, the Trump-trained classifier was quite different than the Obama-trained one.
Overall, however, we see this as an implementable approach to bootstrapped topic understanding. It offers an easy-to-understand and cheap workflow, so we can begin to understand the digital world around us.
Our sincerest thanks again to the Internet Archive and Altiscale for making this research possible!