With Nick Ruest and William Turkel, I’ve been exploring the tweets of the 42nd Canadian federal election that used the #elxn42 hashtag. Having also been part of the team that launched WebArchives.ca during the federal election, I wondered: what would a web archive created using the tweets of users look like compared to a formal collection, curated by a subject librarian? And secondly, how much of it would be in the Wayback Machine more generally?
The Canadian Political Parties and Political Interest Groups collection consists of fifty domains, which you can see here. Our #elxn42 collection contains of over a million and a half tweeted URLs, of which 263,708 were unique. How many of those 263,708 URLs could or will end up in the Canadian Political Parties collection?
To find out, we compared the two lists. I created one list of all of the CPP domains (a text file, one entry per line, such as
ndp.ca, etc.), and one list of all of the tweeted URLs (such as
http://www.votetogether.ca/, etc.). The useful command
grep -wFf Webarchives-domains.md elxn42-tweets-urls-uniq.txt > intersections.txt helped compare the two, listing only links in our Twitter collection that contained the top-level domain from the former.
What did we find?
46,778, or 17.7%, of the URLs tweeted during #elxn42 could have been archived as part of the CPP collection. That means that 82.3% of URLs would not be part of the CPP collection.
What does this mean?
It raises some questions:
- Should we be using Twitter more as a seed list for event-based archiving, similar to what Ed Summers proposed around say the events of Ferguson, MO or during natural disasters?
- Should we be publicizing this more around WebArchives.ca, highlighting the gaps that you would find in a more formal web archive like that?
- What would similar work find?
What about the Wayback Machine more generally?
One pressing research question is, of course, how many URLs end up in the Wayback Machine more generally. It is still early days – we’ll have to check back in later, because many URLs just wouldn’t have been crawled yet – but of those 263,708 URLs, 221,491 were not found using the Wayback Machine Availability API. That’s 83.99% of all URLs tweeted during #elxn42 that you can’t find in the Wayback Machine.