Institutional vs. Twitter Seed Lists for Web Archives

Screen Shot 2015-11-25 at 5.26.35 PM

What if this collection was created by #elxn42 Twitter users – how different would it be?

With Nick Ruest and William Turkel, I’ve been exploring the tweets of the 42nd Canadian federal election that used the #elxn42 hashtag. Having also been part of the team that launched WebArchives.ca during the federal election, I wondered: what would a web archive created using the tweets of users look like compared to a formal collection, curated by a subject librarian? And secondly, how much of it would be in the Wayback Machine more generally?

The Canadian Political Parties and Political Interest Groups collection consists of fifty domains, which you can see here. Our #elxn42 collection contains of over a million and a half tweeted URLs, of which 263,708 were unique. How many of those 263,708 URLs could or will end up in the Canadian Political Parties collection?

To find out, we compared the two lists. I created one list of all of the CPP domains (a text file, one entry per line, such as liberal.ca, ndp.ca, etc.), and one list of all of the tweeted URLs (such as http://conservative.ca/, http://www.votetogether.ca/, etc.). The useful command grep -wFf Webarchives-domains.md elxn42-tweets-urls-uniq.txt > intersections.txt helped compare the two, listing only links in our Twitter collection that contained the top-level domain from the former.

What did we find?

46,778, or 17.7%, of the URLs tweeted during #elxn42 could have been archived as part of the CPP collection. That means that 82.3% of URLs would not be part of the CPP collection.

What does this mean?

It raises some questions:

  • Should we be using Twitter more as a seed list for event-based archiving, similar to what Ed Summers proposed around say the events of Ferguson, MO or during natural disasters?
  • Should we be publicizing this more around WebArchives.ca, highlighting the gaps that you would find in a more formal web archive like that?
  • What would similar work find?

What about the Wayback Machine more generally?

One pressing research question is, of course, how many URLs end up in the Wayback Machine more generally. It is still early days – we’ll have to check back in later, because many URLs just wouldn’t have been crawled yet – but of those 263,708 URLs, 221,491 were not found using the Wayback Machine Availability API. That’s 83.99% of all URLs tweeted during #elxn42 that you can’t find in the Wayback Machine.

2 thoughts on “Institutional vs. Twitter Seed Lists for Web Archives

  1. Michael Neubert (@NeubertMichael) says:

    Interesting!

    * When you say, “in the Wayback Machine” I understand that to mean either in Archive.org or in any publicly accessible Archive-It collection. Is that correct?

    * When you ask, “How many of those 263,708 URLs could or will end up in the Canadian Political Parties collection?” can I assume that you would consider the answer to be “yes” if a Tweet was for XYZ.com/someThing/someOtherThing/target.pdf and the CPP included just XYZ.com as a seed?

    * Tweets seem more likely to be at the level of a page or document than at the level of an organization’s web site – just harvesting Tweets would result in bits and pieces of some sites that otherwise you would have in your Archive-It collection as complete archived sites (at least in theory complete).

    I could see for a political campaign collection supplementing the CPP approach with tweet-driven harvesting but not replacing it.

  2. Ian Milligan says:

    Thanks for your comments, Michael! Just in response to your questions:

    – At this point, actually just the main Wayback Machine and the Canadian Political Parties (CPP) Archive-It collection. As far as I know the Wayback Availability API only covers main wayback, not Archive-It.
    – Yes, in the case of the CPP we’re just going off top-level domain availability – so yes, if CPP included XYZ.com we’d assume XYZ.com/someThing/someOtherThing/target.pdf could end up there.
    – Agreed that it’s be a supplemental approach, as it’d be too scattered as mentioned. Could be neat to collect two archives for comparison purposes, just to see how they stood up next to each other.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s