Playing with the Web Archive Analysis Workshop: A Few Tips for Fellow Tinkerers

Yesterday, I spent part of the day tinkering with the Web Archive Analysis Workshop, put together by the Internet Archive’s Vinay Goel. It’s a great workshop that really lets us play and extract tons of useful data from WARC files. In this post, I won’t rehash the commands, but want to show a few things that I needed to do to get it up and running, and then show some of the potential.

This is mostly just a research notebook for anybody else who stumbles down the path after me.

Getting it Started

The instructions for installation are pretty straightforward. I think I just ran brew install hadoop to get it all working. On my version of OS X, I received an error message when running several of the scripts:

ERROR 2998: Unhandled internal error. org/python/core/PyObject

I was missing a standalone Jython jar in my pig installation. To fix this, I downloaded Jython from here, generated a standalone jar and copied it to my pig install like so:

java -jar jython-installer-2.5.3.jar -s -d /tmp/jython-install -t standalone -j $JAVA_HOME
cp /tmp/jython-install/jython.jar /usr/local/Cellar/pig/0.13.0/lib 

It let me get it all working. For convenience, so I could run this again quickly after rebooting things, I created a script with these commands which I’d run in the directory with my content (in my case, /users/ianmilligan1/internet-research/).

export PROJECT_DIR=`pwd`/archive-analysis/
export DATA_DIR=`pwd`/sample-dataset/
export PIG_HOME=/usr/local/Cellar/pig/0.13.0

Running the Code and Analyzing Results with Bash and Excel: A Few Tips

It all should run quite well. The script is pretty straightforward. Output, like other Pig programs that I’ve run in the past, is generated into the ~/internet-research/sample-dataset/derived-data directory into subfolders as partial files: i.e. part-00000 or part-00000.gz. Continue reading

Exploding ARC Files with Warcbase

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it? Continue reading

Accessing Treasure Troves of Data: Empowering your own Research

[x-posted with]

This post is a bit technical. My goal is to explain technical concepts related to digital history so people can save time and not have to rely on experts. The worst thing that could happen to digital history is for knowledge to consolidate among a handful of experts.

From the holdings of Library and Archives Canada, to the Internet Archive, or smaller repositories like digitized presidential diaries, or Roman Empire transcriptions, there are a lot of digitized primary sources out there on the Web. You don’t need to be a “digital historian” to realize that sometimes there is a benefit to having copies of these sources on your own computer. You can add them to your own research database, make them into Word Clouds (I know, they’re not perfect), or find ways to manipulate them with tools such as Voyant-Tools, a spreadsheet software, or many other tools that are available. If you can download sources, you may not have to physically travel to an archive, which to me suggests a more democratic access to sources.

Digital historians have been working on teaching users how to access the databases that run online archival collections and how to harness this information for your own research. In this post, I want to give readers a quick overview of some of the resources out there that you can use to build your own repositories of information. If you ever find yourself clicking at your computer, hitting ‘right click’ and then ‘save page as,’ or downloading PDF after PDF after PDF… this post will help you better utilize your computer’s tools, making the digital research process a bit quicker.

Continue reading

Public Lecture at Acadia University on “Big Data and History,” November 28th

Screen Shot 2014-11-10 at 2.56.42 PMI’ll be giving a talk entitled “Big Data and History: How Web Archives Will Challenge, Complement and Enhance the Historical Profession” at the Acadia Institute for Data Analytics on November 28th at 3PM, in BAC 132. I’m going to be making the case that historians need to start thinking about data, drawing on arguments around digital preservation, understanding sources, the rise of web archives, and featuring some examples from my own work with GeoCities and the Wide Web Scrape.

A more eloquent abstract:

“Big Data and History” argues that we need to understand the implications of the arrival of new archives: web collections. These collections of websites aggregated into single files necessitate a rethinking of how historians will approach their professional standards and trainings, with particular implications for historians studying topics involving the 1980s onwards. While historians are normally accustomed to not having enough information about their topic, the problem for many is now shifting towards having far too much data. How can humanities-based researchers begin to grapple with these problems?

If you can make it, the event page is available here. I’m really excited to have the opportunity to go back to beautiful Wolfville, Nova Scotia (my partner did her undergraduate degree there, and my friend and colleague Thomas Peace used to teach there, so I’ve heard so much about the university).

SSHRC’s Research Data Archiving Policy and Historians

Whenever I even think about archival trips, my back pre-emptively aches. It involves sitting or standing near documents, taking digital photographs. And I know that if I looked around the archive, chances are that nine out of ten of my colleagues are doing something similar (and yes, because the plural of an anecdote is not data, Ithaka S+R has reported on this widespread trend in historical research).

When we all travel home to our universities, those historians who are travelling on SSHRC’s dime will surely deposit their research data (photos?) after a reasonable amount of time with their organization’s institutional repository or on some other sharing website, right, to make sure their publicly-funded research is made accessible? SSHRC just wants “qualitative information in digital format,” so maybe our photos, or just our notes, right? </sarcasm>

I wager – unscientifically, based only on anecdotal conversations at the Canadian Historical Association, on Twitter, and in hallways – that the vast majority of historians in Canada would be opposed to the very idea, even if their work was generously funded. The value of our work is too wrapped up in the scarcity of sources themselves, rather than just the narratives that we weave with them. Continue reading

Short Interview on CBC’s The Current on Historians and Big Data

IMG_20141001_095116I did a short interview on CBC Radio’s The Current with Anna Maria Tremonti, which aired this morning. I was responding to some of the utopian arguments made by Christian Rudder’s book Dataclysm, noting that while the historical record is going to be enriched by digital sources, we’ve got to consider issues of access, preservation, and funding. I was nervous, but I think I got my main points across pretty decently.

The talk is available here: “Historians want Canada to give them access to Big Data.”

It was a fantastic experience, and it really did get me thinking about how it would be rewarding to build the capital to get website legal deposit in Canada — or at the very least, to get the preservation of digital resources a little bit more on the table. Maybe I’ll try my hand at some popular writing.

The Future of the Library in the Digital Age? Worrying about Preserving our Knowledge

X-Posted with

By Ian Milligan

Yesterday afternoon, in the atrium of the University of Waterloo’s Stratford Campus, a packed room forewent what was likely the last nice weekend of summer to join Peter Mansbridge and guests for a discussion around “What’s the future of the library in the age of Google?” It was aired on CBC’s Cross Country Checkup on CBC Radio One, available here. It was an interesting discussion, tackling major issues such as what local libraries should do in the digital age, issues of universal accessibility, and whether we should start shifting away from a model of physically acquiring sources (notably books) towards new models for the 21st century. Historians, and those who care about history, have much to contribute to these sorts of conversations. Those who know me or have read my writings over the last three years know that I’m not a luddite. But I came away worried about some of the assumptions made in the conversation, and what they mean for us who write about the past.

A big crowd of folks who care enough about libraries to spend a beautiful Sunday afternoon in a university building lobby.

A big crowd of folks who care enough about libraries to spend a beautiful Sunday afternoon in a university building lobby.

I don’t want to rehash the conversation, as you could rewatch it, but a brief summary of some of the main themes might help. The broadcast began with Peter Mansbridge asking the major question “Digital technology is changing the way we store information, and how we learn from it. Does it make sense to stack printed books in costly buildings when virtual libraries are just a mouse-click away?” Mansbridge was joined by Christine McWebb, director of academic programs at the Waterloo Stratford Campus, and Ken Roberts, former chief librarians of the Hamilton Public Library and a member of the Royal Society of Canada’s Expert Panel on the Future of Libraries and Archives in Canada. Continue reading