Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.
You can listen to the podcast here. Hopefully I am cogent enough!
It grew out of my lightning talk and poster, also available on my blog.
My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.
Yesterday, I gave my talk on “Finding Community in the Ruins of GeoCities.” Here is my poster:
(You can download a PDF version here)
My poster – click to see a (very) large version.
I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell).
If you want to see the poster, please click here.
Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.
GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.
These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Continue reading
While this blog has mostly been focusing on some of my new work lately, my first book – which grew out of my doctoral dissertation and which was published last summer – has been getting some good early positive buzz.
Most importantly, I learned on Wednesday that my book Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada was shortlisted for the Sir John A Macdonald Prize, which is given to the best book of Canadian historical non-fiction. You can read about it (and the four other books that made the cut) here. I’m shocked and honoured to be in such esteemed company. The final results will be announced in early June at the Canadian Historical Association’s annual meeting. Hope to see some of you there.
Otherwise, the first few reviews have come in… and they’re positive so far (I know that this will change). BC Studies has a positive advance review up on their website. Blacklock’s Reporter was first out the gate in their review, writing: “Ian Milligan, professor of history at the University of Waterloo, corrects the record through meticulous research and interviews.” Briarpatch Magazine includes it in a review of three books from UBC Press, noting that the books serve as “important reminders that history is not easily bookended by the arbitrary markers of decades and is instead a living process that holds new discoveries for those willing to dig deeper and learn vital lessons for today’s struggles.” Finally, Choice Reviews Online has given it a “highly recommended” rating in their review, arguing that “[t]his clearly and accessibly written book is a wonderful contribution to the history of the turbulence of the 1960s.”
Forgive my horn tooting. We’ll be back to our regularly-scheduled programming soon – and I actually have some pretty cool news to share with everybody in a few weeks (he says cryptically).
This script helps us get longitudinal link analysis working in Gephi
When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. liberal.ca could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.
Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:
[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]
1,719,167 links between roughly 622,000 .ca websites captured in 2011.
In this short post, I want to take you from a collection of WARC files to a webgraph, which you can see pictured at left.
If you’ve been following me on Twitter, you know that I’ve been playing around with warcbase, part of my overall exploration into web archives. Warcbase is an “open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge.” While I’m not terribly fluent at these platforms – getting better at them is one of the goals of my upcoming junior sabbatical – Jeremy Wiebe, who’s been working with me on this project, is. He wrote a tutorial, “Building and Running Warcbase under OS X” for me.
Using Wiebe’s tutorial, you can ingest data into HBase and do a variety of things with it. For this task, I focused on the last section labelled ‘Pig Integration.’ If you want to ingest WARC files rather than ARC files, you need to call “WarcLoader” instead of “ArcLoader” on line 3. You then set your directory with WARCs on line 6, and your output directory on line 12. See: Continue reading
The long-awaited Tri-Agency Open Access Policy on Publications arrived today. Just before I went into a meeting, I decided to tweet a quick announcement about it. For many Canadian scholars, I think this was the first they’d heard of it!
This move will require all grant recipients funded by the Social Sciences and Humanities Research Council (SSHRC) or its sister agencies the Natural Sciences and Engineering Research Council (NSERC) or the Canadian Institutes of Health Research (CIHR) to make their peer-reviewed journal publications freely accessible within twelve months. I’ve been waiting for this news to break for months: it was apparently supposed to surface back in October 2014 during Open Access Week, but it’s been suspected that the dreadful shootings in Ottawa that month may have delayed it. That’s rumour, though, so don’t put too much stock in that.
Given the response to my little tweet, I thought a blog post might be useful. Bear in mind that this was written in roughly the hour following the announcement, so more details may emerge and I’m sure thoughts will evolve over the next few days and weeks.
I’m personally happy about this move for a number of reasons: