Soft Launch of WebArchives.ca

Screen Shot 2015-07-29 at 12.21.32 PMWeb archives have a lot of very useful information in them! As websites disappear every second on the Web, we need to save sites now. Luckily, we’ve been saving sites since 2005: even if they don’t exist on the live web today, we may have them saved for historical research.

This is where WebArchives.ca comes in, which we’ve been “softly” launching this week – a public kicking of the tires (tell your friends about us). This is hopefully the first of many portals that we’ll be putting up on this site, using different research tools. In a nutshell, we provide access to the University of Toronto’s Archive-It Collection of Canadian Political Parties and Political Interest Groups, which they have been collecting since late 2005. For information on what is within this collection, please see the University of Toronto’s page. This site uses the UK Web Archive’s shine interface, which they have made available here.

We can use web archives to see what this page used to contain!

We can use web archives to see what this page used to contain!

For example, did you know that the Green Party of Canada ran a public blog on their website back in 2008, where anybody could write in? Today, if you try to visit them, you’ll receive a “403 Access Denied error”). Look for yourself: on our “advanced search” page, you can search “harper” and “fascist” with a proximity of “25” to see some provocative posts on this Green Party blog (results here). These are just a few random examples: you can certainly find hundreds more as you begin to explore through our portal.

Relative trends of

Relative trends of “recession” and “depression” in our political collection.

With literally millions of pages – there are 14,490,355 “documents” in the archive found here – you sometimes need to pull your gaze back to see how ideas have risen and fallen. For example, we can discover how terms like “depression” and “recession” waned and rose over time, through our trends view. We’ve tentatively found that left-wing groups tended to use the word “depression” more than centrist or right-wingers, who used “recession” more during the economic crisis? There is a literal treasure trove of stories to be found in these collections, limited only by your imagination.

Acknowledgement and Thanks to the Team

This has been a joint production! At Waterloo, I’ve been working with Shawn Dickinson and Danielle McDonald on implementing this portal (I have two other RAs – Dave Hussey and Jeremy Wiebe – who’ve been working on other projects related to digging into the WARCs themselves). Jimmy Lin, newly arriving at Waterloo, has been making it possible for us to index material – using warcbase and the UK Web Archive’s hadoop indexer – in something shorter than a week a collection. At York University, where this server sits, Nick Ruest has been doing the heavy lifting to make this site a (pretty) reality. At Toronto, Nicholas Worby gave us access to these files. At the Internet Archive, Jefferson Bailey got the ball rolling with the Archive-It Research Services and connecting us to the Toronto folks. Finally, at Western, Bill Turkel’s also providing support and soon some cool Mathematica hacks.

And – of course – the UK Web Archive got the ball rolling with Shine!

Setting up the Termite Data Server: A New Walkthrough

A Termite Topic Model Visualization of the Green Party’s Website from September 2007.

A Termite Topic Model Visualization of the Green Party’s Website from September 2007.

We’ve been working on various visualizations for our web archives collections. One bottleneck was topic modeling using MALLET: both due to limitations on just how fast we can get it running, but also into how to make the results usable for the average user.

Termite was one such option. While it has decent documentation, it can be difficult to munge data into it.

Shawn Dickinson, one of my RAs in the Web Archives for Historical Research Group, wrote up some great code that takes a directory of text files and prepares them for Termite.

As with all our walkthroughs, it is available in our GitHub repository as Setting up Termite Visualizations on OS X”. Feedback is always appreciated, either here or by submitting a Pull request.

Web Archive Legal Deposit: A Double-Edged Sword

I couldn't take pictures of the web archive, but here's the lineup to enter the archive at 9:30am!

I couldn’t take pictures of the web archive, but here’s the lineup to enter the archive at 9:30am!

I’ve heard so much about legal deposit in the context of web archiving, and have been enthralled with what it represents: a recognition that born-digital sources are today’s documentary record, the need to preserve it more, and the institutional and legal commitment to make sure that happens. If we’d had non-print legal deposit in 2008, historians might today be studying AOL Hometown, one of the early mass deletions on the Web.

But I knew that legal deposit came with some restrictions. In return for the legal authority for collecting libraries to collect all of this information, they were bound by many of the restrictions placed on print books: on-site consultation only, limitations on reproduction, and a maximum of one person at a time viewing a website.

I wondered how this would all work out, so on my way back from the Web Archives as Scholarly Sources conference in Denmark, I decided to make a quick two-day stop in London. There, I had the opportunity to stop by the UK Web Archive at the British Library. Helen Hockx-Yu, the Head of Web Archiving there, gave me a guided tour of the Web Archive and an opportunity to see both the user-facing interface as well as their back end. It helped complicate some of my views.

The User Experience: A Mixed Bag

If you want to view the UK’s legal deposit web archive, you need to physically go to one of their six legal deposit libraries: the British Library at King’s Cross in London, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries at Oxford University, the University Library of Cambridge University, or the Library of Trinity College, Dublin. Armed with a reader pass, you can go into one of the reading rooms and sit down at one of their reference terminals. Continue reading

All Good Things…. (‘Retiring’ from ActiveHistory.ca)

Screen Shot 2015-07-13 at 9.26.58 AMAfter six years as a co-editor with Active History, a website that I helped build with a group of great colleagues back in 2009, the time has come to take a back seat there. Given the amount of time I’ve put into the site, it seemed fitting to close my time there off with a quick post.

Active History meant different things to each of us, but for me it became the provision of a publishing platform for people who wouldn’t otherwise feel comfortable spreading their thoughts on the Web. I’ve jokingly called it a Medium for historians, before Medium existed: people who wanted to blog, but not frequently enough to have their own webpage.* While it grew out of an initial impetus to reach a general audience during the 2008-09 financial crisis, it’s turned into a diverse and extensive group blog. In a few years, it’ll be neat to look back, and wonder how many sites we spawned out of the site – there’s certainly a trend that I’ve seen where somebody blogs with us, and then begins cross-posting to their own site, until they happily part ways to their own site.

Why leave now? First, it’s best to leave things when they’re on top, which Active History certainly is based on our initial goals: every day hundreds, occasionally thousands, of people come visit the new post. Most of our visits now come to our extensive collection of old posts, papers, podcasts, and reviews. Second, six years is a long time. That’s the length of the European Second World War! I’m a different person, with different interests, than I was six years ago. As I go into my first sabbatical, I’ve begun to seriously consider where my energies are best spent (right now it’s trying to lower the bar for working with web archives, which I think will be my generation’s moon landing)**.

Thinking back to our first meeting at York University, where we set up this website, I can’t be more happy about the experience. It took me to a THATCamp, introduced me to the digital humanities, let me think more about the profession, informed me about scholarly dissemination, and gave me something to be proud of whenever I was able to explain the project. I certainly didn’t agree with every post we ran, and at times wished we could have attracted a more diverse range of political perspectives, but I think we provided a great representation of mainstream thought within the Canadian historical profession.

Thanks, Active History. It’s been a heck of a run!

* Alternatively, this may suggest that I still haven’t fully figured out what Medium is.
** Not really. But you can’t study the 1990s without ’em. ;)

Have Web Collections? Want Link and Text Analysis? Check out the Warcbase Wiki

The Warcbase wiki in action!

The Warcbase wiki in action!

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.

One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!

I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:

  • Screen Shot 2015-06-05 at 11.51.29 AM

    A dynamic visualization generated with warcbase and Gephi

    Link visualizations to explore networks, finding central hubs, communities, and so forth;

  • Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
  • Overall statistics to find over- and under-represented domains, platforms, or content types;
  • And basic n-gram-style navigation to monitor and explore change over time.

All of this is relatively easy for web archive experts to do, but still difficult for endusers.

The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.

Announcing my Ontario Early Researcher Award: Web Archives for Historical Research

ul19I’ve been sitting on this good news for a few months now, but the official word is out: the Ontario Ministry of Research and Innovation has funded by Waterloo_ARTS_History_Logo_bkweb archive project with an Ontario Early Researcher Award. These grants are designed to help early career researchers build up research teams by hiring graduate students, postdoctoral fellows, and research associates – all things that I’m hoping to do over the next five years (we also received some complementary funding that will be announced in due course). It gives me $150,000 over the next five years to begin building the Web Archives for Historical Research Group.

Since May 2015, I’ve been able to hire three research assistants with this line: Jeremy Wiebe, a PhD candidate; as well as MA candidates Shawn Dickinson and Danielle McDonald. David Hussey, my MA student who’s been working on a digital history of the Canadian video games industry, has also been working on the project as part of some complementary funding. Their profiles are available here, along with Nick Ruest and Bill Turkel who are joining me as affiliate faculty for a broader, separate grant.

The University of Waterloo announcement is here, and I think it does a good job explaining what the project does. I wanted to really thank the amazing folks at the University of Waterloo’s Office of Research, the Arts Research Office, and the Department of History for helping with the groundwork that made this possible. UW truly has offered amazing resources to get my project off the ground.

Creating Link Graphs with Warcbase

Screen Shot 2015-06-05 at 11.51.29 AMI was at the Columbia Web Archiving Collaboration: New Tools and Models conference this Thursday and Friday, and gave a quick demo. Here’s a bit more detail on it.

If you use Warcbase, using this handy guide to installing it on OS X, and follow scripts, you will eventually come up with a data file that looks a bit like this.

200510	acq.osd.mil	acq.osd.mil	96
200510	acq.osd.mil	akss.dau.mil	12
200510	agoracosmopolite.com	agorabookcafe.com	325
200510	agoracosmopolite.com	agoracosmopolitan.com	271
200510	agoracosmopolite.com	agoracosmopolite.com	8319
200510	agoracosmopolite.com	genesmedia.com	325
200510	bloc.org	go.microsoft.com	22
200510	blocpot.qc.ca	blocpot.qc.ca	104
200510	blocpot.qc.ca	marijuanaparty.org	16
200510	blocpot.qc.ca	norml.org	16
200510	blocquebecois.org	bernardbigras.qc.ca	16
200510	blocquebecois.org	bloc.org	1069
200510	blocquebecois.org	blocquebecois.org	276682

You can download your own sample file here, which draws on the Canadian Political Party and Political Interest Groups collection.

But how do you turn this into a beautiful Gephi visualization?

Easy! Continue reading