A screenshot of our web archives for historians page, showing the main splash page.In early March, Peter Webster of the British Library and I launched “Web Archives for Historians,” a crowdsourced bibliography comprised of works written by historians who use or think about how we can use web archives (read his launch post here as well). We also have a second tab that asks people who are interested in web archives to fill out a form, so we can get a sense of who might be out there!

Our criteria are relatively specific, focused particularly on the interplay between historians and web archives:

We want to know about works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources. Work concerned with online representation of the more distant past is also within our scope.

So far the bibliography is growing, and we just added some content today. Whenever updates are made we’ll also make sure to tweet on our Twitter account, @HistWebArchives. If you’re interested, please visit our site or follow us.

It’s my sincere hope that we can start to build a historian/web-archive specific community. I’ve argued elsewhere that I think historians need to work towards further engagement with web archives, as they’ll become one of the primary sources to understand social, cultural, political, etc. life from the mid-1990s onwards. They present a tremendous resource, but they’re on such a different scale than the normally scarce archival resources that most historians are professionally trained with, so let’s start laying the groundwork now.

I look forward to seeing you over at Web Archives for Historians.

(x-posted with ActiveHistory.ca)
By Ian Milligan

“Thirty goons break into your office and confiscate your computers, your hard drives, your files.. and with them, a big chunk of your institutional memory. Who you gonna call?” These were the words Bob Garfield used in a recent episode of On the Media, to address the storming of the Crimean Center for Investigative Journalism. On Saturday, March 1st, 2014, during the Russian occupation of the Crimea, men with guns stormed and occupied the offices of the Crimean Center for Investigative Journalism. The staff fled, managing to take only part of their files and equipment, although not everything. Over the rest of the weekend, the Center reached out to the Internet Archive to preserve their web material. The episode attracted the attention of the global media, web archivists, and historians. Historians deal with source losses all the time – sources destroyed by events (from wars, political malfeasance, and so forth) – but here we see how quickly the process of archiving and preserving has sped up.

The Internet Archive, which I’ve written about before for ActiveHistory, tries to back up much of the publicly-accessible web. It had not however captured comprehensive holdings of this particular site. If something happened, if the servers were wiped, there were fears that all of their past stories, information, and so forth would be lost. These would be critical for the group, but also, of course, for historians. So from their offices in San Francisco, the Internet Archive’s Archive-It service carried out a comprehensive sweep of the Crimean Center for Investigative Journalism’s website, capturing it now 14 times between March 1st and 19th. 5,185 videos have been captured. Indeed, in case they were taken down off YouTube, they are now preserved.


Taking the full text of my sample of my Canadian (.ca only) websites (currently being finessed to amount 622,365 URLs out of a scrape total of 8,512,275 or 7.31%), I ran it through Stanford NER and extracted popular locations, organizations, and people. This is a morning’s work, mainly as I let my desktop crunch away at some other stuff, so I really need to preface the post that the data has not been cleaned up.

The results were interesting but fairly dry: “Canada” was the top location, for example, followed by Ontario, Toronto, Ottawa, Alberta, etc. The United States comes out as under-represented mainly because we have so many spellings of the word (US, u s, America, United States, etc.). There will be a similar issue with the United Kingdom. If this turns into ‘real’ research rather than tinkering, again, there’s a lot of cleaning up to do. But overall, we can get a rough sense of different countries and how they appeared in this sample.

Thanks to IBM Many Eyes we can throw this stuff at the wall and see what comes out.

Screen Shot 2014-02-06 at 11.47.16 AM

In this graphic, we see different countries and how they are mentioned. With the caveats that this is rough data, we can see the big parties emerge. Canada drowns out all others, unsurprising given the sample. But the other big parties emerge: the United States, Russia, China, Brazil, Australia. Western Europe. But almost nothing in sub-saharan Africa, which tells you something about this coming together. The way that this map emerges I think shows that it’s working a bit.

Let’s take Canada alone: (more…)

Screen Shot 2014-01-20 at 1.04.16 PMOK, you’re all forgiven: when you hear ‘open data,’ the first thing that springs to mind probably isn’t a historian (to some historians, it’s the first episode of the BBC show ‘Yes, Minister’). In general, you’d be right: most open data releases tend to do with scientific, technical, statistical, or other applications (releasing bus route information, for example, or the location of geese at the UW campus). Increasingly, however, we’re beginning to see a trickle of historical open data.

Open government is, in a nutshell, the idea that the people of a country should be able to access, read, and even manipulate the data that a country generates. It is not new to Canada: Statistics Canada has been running the Data Liberation Program since at least late 1996, and there have been predecessors before that, but the current government has been pushing an action plan which has materialized in data.gc.ca.

While I am not a fan of the current government’s approach to knowledge more generally, I am happy with the encouraging moves in this realm. Criticism of the government is often very deserved, but we should celebrate good moves when they do happen, however slowly this may occur. Indeed, if the government is opening up their data, maybe it should inspire publicly-funded scholars to do the same: think of what we could learn from the quantitative findings of the Canadians and their Pasts project, for example!

In this post, I want to show some of the potential that is there for learning about the past through Canadian open data (drawing on some of the provincial datasets too), in the hopes that this will spur interest in maybe getting more released. I even have a little bit for everybody: There’s data here from which political, military and social historians can draw.  Let me show you how. (more…)

In my last post, I walked people through my thoughts as I explored a large number of images from the Wide Web Scrape (using, as noted there, methods from Lev Manovich). In this post, I want to put up three images and think about how this method might help us as historians. Followers of my research might know that I am also playing around with the GeoCities web archive. GeoCities was arranged into neighbourhoods, from the child-focused EnchantedForest to the Heartland of family and faith or the car enthusiasts of MotorCity. Each neighbourhood was, in some ways, remarkably homogenous.

Let’s take every JPG from the ‘Athens’ (the teaching/philosophy/etc. focused area of GeoCities) and see what we find. (more…)

As followers of this blog know, one of my major research activities involves the exploration of the 80TB Wide Web Scrape, a complete scrape of the World Wide Web conducted in 2011 and subsequently released by the Internet Archive. Much of this to date has involved textual analysis: extracting keywords, running named entity recognition routines, topic modelling, clustering, setting up a search engine on it, etc. One myopia of my approach has been, of course, that I am dealing primarily with text whereas the Web is obviously a multimedia experience.

Inspired by Lev Manovich’s work on visualizing images [click here for the Google Doc that explains how to do what I do below], I wondered if we could learn something by extracting images from WARC files. I took the WARC files connected to the highest overall percentage of .ca domain files, drawing on my CDX work, and quickly used unar to decompress them. The files that I drew on were the ten WARC.GZ files from this collection, totally 10GB compressed or

I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here. (more…)

I’m not a political historian, although I did serve as the Secretary-Webmaster of the Political History Group for two years. Today, after having today’s class all prepared but not quite in the right mind frame to look over some manuscript drafts, I decided to play with some of the data that you can find in Canada’s open data repository.

I view this as sort of “putting in the hours” like a pilot would: I now find myself wrapped up in writing and using off-the-shelf analysis software that it’s good to keep my data analysis and programming skills a bit honed. But I was also thinking: it’s a great example of showing what, with a bit of computational skill, you can learn in about thirty minutes of unstructured data play. And, like my post from yesterday, my dream with all of this is that somebody will stumble across this and decide that there is some potential for their own work.

In any event, this morning I stumbled across History of the Federal Electoral Ridings, 1867-2010 and grabbed the English-language CSV file. It’s a big one, containing the information of 38,778 candidates for federal office in Canada. It’s a thirteen column file, with the following entries: (more…)