Great Quotation about the Value of History

From Brian Christian’s The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive – but actually not terribly relevant to the subject of the book at hand. On the value of history:

I was detachedly roaming the Internet, but there was nothing interesting happening in the news, nothing interesting happening on Facebook . . . I grew despondent, depressed – the world used to seem so interesting . . . But all of a sudden it dawned on me, as if the thought had just occurred to me, that much of what is interesting and amazing about the world did not happen in the past twenty-four hours. How had this fact slipped away from me? (Goethe: “He who cannot draw on three thousand years is living hand to mouth.”)
 

Anyways, I’ll probably see what my students in HIST 250: The Art and Craft of History think about this idea in the Fall.

Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago). Continue reading

Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course.. Continue reading

Getting Gephi Running on OS X Mavericks

Network of Texan Correspondence, a Historian's Macroscope Example.

Network of Texan Correspondence, a Historian’s Macroscope Example.

I’ve been using Gephi as I playtest through one of our final drafts of the Historian’s Macroscope – which has returned me to face one of my old nemesis.

Just because this has been a constant pain in the you know what, I wanted to post how I got it working in case it helps a fellow DHer. I suspect it’s a conflict with my JDK, as I know folks have trouble if they’ve been playing with MALLET and then switching over to Gephi. Anyways, it’s been constantly hanging (on the data laboratory screen), but only today did I actually force myself to fix it.

Open up /applications/gephi.app/contents/resources/gephi/etc/gephi.conf. On the command line, you can just run that with vim, or you can navigate to it yourself and open it up with your regular text editor.

[edited to add: Jonathan Goodwin pointed out that vim is kinda hard to learn - what I would recommend would be either: (a) use your 'finder' to go to the 'Gephi.app' icon in your 'application folder,' right click and select 'Show Package Contents' and then navigate from there, and then open up 'Gephi.conf' in your TextEdit program; or (b) use the command line to navigate to it, and again, if you don't use vim just type open . and you'll get a finder window there. Good luck!]

Under the line:

# default location of JDK/JRE, can be overridden by using --jdkhome switch

replace the existing value with:

jdkhome="/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home"

Hat-tip to the Gephi forums for this.

DH 2014 Slides and Talk: “Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource”

I gave a short paper at Digital Humanities 2014 last week. Held on the joint campus of the l’Université de Lausanne and the École polytechnique fédérale de Lausanne, it was a very good time. I might put a quick post up about some of my other observations, notably my visit to CERN, later on.

Here are the slides and text of my short talk. It’s a variation of the paper I also gave at the International Internet Preservation Consortium annual meeting in May. It concludes my conferencing for this summer – I have a workshop on computer vision and history during the Fall semester, and then the American Historical Association’s annual meeting in January. But until then: writing, research, and not flying in a plane.

Note: You should be able to reconstruct much of this approach from various blog posts on this site. See, for example: my WARC-Tools section, my clustering section, my NER posts, regular expression tinkering, and beyond (all my web archiving posts). Happy to chat, too. A downside of blogging so much is that many of will have seen much of this before.

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

DH Abstract available here

DH2014.001

Hi everybody. My name is Ian Milligan and I’m an assistant professor of Canadian and digital history at the University of Waterloo. In today’s short talk, I want to provide you with the approach I’ve been using to navigate large amounts of web information that uses open-source tools, provides an ad-hoc finding aid to me, and helps me make sense of the data deluge that historians are soon going to have to grapple with.

Using a case study of data that the Internet Archive released in October 2012, which is an entire scrape of the World Wide Web, I try to figure out what historians need to know to deal with this material. Continue reading

Three Tools for the Web-Savvy Historian: Memento, Zotero, and WebCite

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you'll get.

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you’ll get when you follow them. There’s no excuse for this anymore.

By Ian Milligan

“Sorry, the page you were looking for is no longer available.” In everyday web browsing, a frustration. In recreating or retracing the steps of a scholarly paper, it’s a potential nightmare. Luckily, three tools exist that users should be using to properly cite, store, and retrieve web information – before it’s too late and the material is gone!

Historians, writers, and users of the Web cite and draw on web-based material every day. Journal articles are replete with cited (and almost certainly uncited) digital material: websites, blogs, online newspapers, all pointing towards URLs. Many of these links will die. I don’t write this to be morbid, but to point out a fact. For example, if we search “http://geocities.com/” in Google Books we receive 247,000 results. Most of those are references to sites hosted on GeoCities that are now dead. If you follow those links, you’ll get the error that the “GeoCities web site you were trying to reach is no longer available.

What can we do? We can use three tools. Memento to retrieve archived web pages from multiple sources, WebCite to properly cite and store archived material, and Zotero to create your own personal database of archived snapshots. Let’s look at them all in turn. Continue reading

ACA 2014 Presentation: “The Great WARC Adventure: WARCs from Creation to Use”

Click through for our slide deck.

Click through for our slide deck.

I’ve been having a great time here in Victoria, BC at the Association of Canadian Archivists’ annual meeting. As a historian, it’s been great learning from archivists: I’ve got a growing file full of journal articles to read, debates to brush up on, and lots of specific tidbits about how born-digital records fit within the archival context.

My contribution to the conference was a paper that I co-presented with Nick Ruest, the Digital Assets Librarian at York University. Our paper, “The Great WARC Adventure: WARCs from creation to use,” focused on the specific case study of the #freedaleaskey collection, a web archive comprised of daily crawls of blogs, websites, and discussion boards pertaining to the Dale Askey libel case.

Our slides are available here, via the York University institutional repository. We’re planning to do some more work on this project, and the slides may not be completely clear without our commentary. The key take home was that these kinds of web archives are pretty cheap and easy to make (this was created using wget), and the daily frequency of the web crawls can allow historians to do some fun longitudinal distant reading. Stay tuned for more on this over the next few months.