Nick Ruest and myself have a piece that’s just come out in Code4Lib Journal. The article takes readers through the (a) why Twitter matters for event archiving and future historical research; (b) how you can collect data yourself; and (c) how you can analyze the data. You can read the abstract below, and check out the article here!
As always, hope you enjoy reading it, and if you have any comments, questions, or anything, we are always happy to hear from you.
Abstract follows after the fold.
This article examines the tools, approaches, collaboration, and findings of the Web Archives for Historical Research Group around the capture and analysis of about 4 million tweets during the 2015 Canadian Federal Election. We hope that national libraries and other heritage institutions will find our model useful as they consider how to capture, preserve, and analyze ongoing events using Twitter.
While Twitter is not a representative sample of broader society – Pew research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income) – Twitter still represents an exponential increase in the amount of information generated, retained, and preserved from ‘everyday’ people. Therefore, when historians study the 2015 federal election, Twitter will be a prime source.
On August 3, 2015, the team initiated both a Search API and Stream API collection with twarc, a tool developed by Ed Summers, using the hashtag #elxn42. The hashtag referred to the election being Canada’s 42nd general federal election (hence ‘election 42’ or elxn42). Data collection ceased on November 5, 2015, the day after Justin Trudeau was sworn in as the 42nd Prime Minister of Canada. We collected for a total of 102 days, 13 hours and 50 minutes.
To analyze the data set, we took advantage of a number of command line tools, utilities that are available within twarc, twarc-report, and
jq. In accordance with the Twitter Developer Agreement & Policy, and after ethical deliberations discussed below, we made the tweet IDs and other derivative data available in a data repository. This allows other people to use our dataset, cite our dataset, and enhance their own research projects by drawing on #elxn42 tweets.
Our analytics included:
- breaking tweet text down by day to track change over time;
- client analysis, allowing us to see how the scale of mobile devices affected medium interactions;
- URL analysis, comparing both to Archive-It collections and the Wayback Availability API to add to our understanding of crawl completeness;
- and image analysis, using an archive of extracted images.
Our article introduces our collecting work, ethical considerations, the analysis we have done, and provides a framework for other collecting institutions to do similar work with our off-the-shelf open-source tools. We conclude by ruminating about connecting Twitter archiving with a broader web archiving strategy.