Next week I’ll be giving a three-minute lightning talk at the “(Digital) Humanities Revisited – Challenges and Opportunities in the Digital Age” conference in Hannover, Germany.
This format is very challenging: and I have skewed to start off provocative, as you can see in the slide at right, and then moving into the specifics of some of my work and what I think we need to start doing to equip historians to play with these new forms of primary sources. Since it’ll be a real distillation of some of the things that I have been doing, I’ll try to record it and at least post the text here as well.
And hopefully, between finishing up some marking for my undergraduate courses, I’ll be able to take in the sights of an early December northern Germany!
A sample of the output, full version posted below.
I’ve had an entire day to mostly play with my research today (which is pretty rare in November). I had a chance to update a script I worked on a few month ago, which required some tinkering to get up and running under OS X 10.9 and still unfortunately at this point requires a Mathematica license, but as a concept I hope some might find it interesting.
Everything is up on my github here.
Basically, with pre-requisites installed, this is a script that with one command:
sh ./WARC-to-Analysis-ner.sh ianmilligan.warc "history"
ianmilligan.warc can be replaced by your own WARC file, and
"history" by a keyword that you might be particularly interested in viewing in context.
It does the following:
- Turns your raw WARC file into a full-text searchable repository using WARC Tools;
- Uses MALLET to topic model your full text;
- Generates a PDF, using Mathematica, that can serve as a rudimentary overview of what that web archive contains. By default, you get the following:
- Web URL
- Date of WARC Scrape
- Short Text Preview
- Simple Word Cloud of Frequency
- Keyword in Context of the selected keyword
- Top 50 extracted people names
- Top 50 extracted location names
- Top 50 extracted organization names
- MALLET Topics, arranged with sparklines so that you can see how the topic is distributed throughout the archive. Is it a widely-distributed topic or is it present in only a few parts of the archive.
I am planning to test it out on the 80TB Wide Scrape as well as my GeoCities archive, so hopefully we’ll get some ‘in the field’ responses about whether this is useful or not! Edited to quickly note that the NER results are really, really messy – but I think at a glance they give you a sense of what an archive contains. Am looking forward to playing with this some more though.
Example output below (click ‘continue reading’ if on my home page): Continue reading
Last week, my new article “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010″ appeared in the December 2013 issue of the Canadian Historical Review. It is not yet available on Project MUSE (it will be shortly), but if your institution is a subscriber to the journal itself you may be able to get access here.
[Edited to add: Project MUSE has been updated and it is there now as well]
I wanted to highlight it for two reasons:
Firstly, I think the argument itself is important. In a nut shell, I look at how the availability of newspapers changes how we write our histories. As I put it, historians cite what is online. Yet we have not made this explicit. In this article, I argue that we need to do so.
Secondly, it reflects my new approach to scholarly dissemination: from blog post to peer-reviewed article. In late March 2012, I published “Illusionary Order: Cautionary Notes for Online Newspapers” on ActiveHistory.ca. It had an overwhelmingly positive reception: reprinted by History News Network, discussed on H-Net, linked to by many libraries, the American Historical Association, and extensively spread around on Twitter. It also ended up on DH Now. I had been planning to make it an article, but this cemented the plans. I stepped up my research, mined over a thousand Canadian history dissertations, and after submitting it and going through the review process it was accepted in March 2013. Given that my goal is to make this a part of graduate pedagogy in Canadian history, the Canadian Historical Review was arguably the best place to publish this piece.
If it hadn’t been for blogging, I don’t think it would have been so quick. Anyways, I will update when it is available on Project MUSE as I suspect that will be better for many of you.
Continue reading for an introduction to whet your appetite: Continue reading
My autumn 2011 BC Studies article “Coming off the Mountain: Forging an Outward-Looking New Left at Simon Fraser University” is now available freely here. In the article, I explore the outward-looking manifestations of originally campus-based student New Left, and how out of the ‘troubles’ at SFU we see a move into the broader Vancouver community.
This also has the distinction of being the last article that I published out of my dissertation, which will soon be appearing in heavily-revised form as Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada (UBC Press, Spring 2014).
There’ll be more on that over the next few weeks, so stay tuned!
It’s like a real estate report… for the historical Internet.
[a random blog post of some thoughts, mostly just to keep myself thinking about stuff]
With GeoCities, I have been particularly interested in the neighbourhoods and communities that have formed. How cohesive were they? Were they virtual communities (in this, getting into the debates kicked off by folks like Howard Rheingold, Lori Kendall, and Constance Porter amongst many others)? Did they link to each other? How did web rings work in this community? etc. etc. The neighbourhoods are pretty cool indeed. The community metaphor found its best expression when it came to the neighbourhoods that made up GeoCities. As the authors of Creating GeoCities Websites noted in their book: members “aren’t simply customers; they’re Homesteaders. Because GeoCities is more a community than simply a place to store a few Web pages, the goal is to make all members feel at home.” And these homes required neighbourhoods. The system was made up a series of neighbourhoods, each with their own thematic currents and internal structures. Continue reading
SEASR running after two quick clicks.
A few years ago, I was really into the SEASR MEANDRE workbench work environment: a visual way to program complicated ‘flows’ of various components hooked together. Its easy and casual integration with OpenNLP, basic visualizations, and a host of textual analysis was really appealing (for example, see some playing I did with historical documents and sentiment analysis). I saw the teaching possibilities in it. It was hindered, however, by its steep technical curve: installation required a very basic working knowledge of MongoDB, how to execute Scala, as well as a casual fluency on the command line to run programs, kill processes, and to tinker.
The other day, however, I reinstalled MEANDRE.. and it is a lot easier to get up and running now. With some testing on other platforms, it’s almost ready to play with in a third-year digital history class! Continue reading
Checking out a 1857 book from the Internet Archive, no big deal.
[x-posted with ActiveHistory.ca
By Ian Milligan
For many students, it’s back to school season. For me, that means it is time to think about some of the resources and tools that are out there. If you want to research a topic, it’s worth keeping in mind some great repositories online. The big one online is the Internet Archive – which is not just old websites.
I’ve written about the Internet Archive before, and it’s actually the main source base for my current major research project. But today I want to give a brief sense of what else you can find there in terms of digitized primary sources, amongst this massive newfangled Library of Alexandria that should be so central to many of our workflows. If you’re a historian, or are interested in history, I guarantee you’ll find something useful in the Internet Archive. Heck, if you use Mozilla Firefox, install a search plug-in right now for it. We’ll be here when you get back.
The inspiration for this post is the accomplishment of yet another major milestone: two million books, all freely downloadable, generated by a large network of some 33 scanning centres around the world. And that’s just books – there are additionally millions of texts. Continue reading