Using Images to Gain Insight into Web Archives?

Do you like animated GIFs? Curious about what this? Then read on.. :)

Do you like animated GIFs? Curious about what this? Then read on.. :)

Full confession: I rely too heavily on text. It was a common proviso of the talks I gave this summer on my work, which focused on the workflow that I used for taking WARC files, implementing full-text search, and extracting meaning from it all. What could we do if we decided to extract images from WARC files (or other forms of web archives) and began to distantly read them? I think we could learn a few things.

A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.

A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.

This continues from my work discussed in my “Image File Extensions in the Wide Web Scrape” post which provides the basics of what I did to begin to play with images. It also touches on work I’ve done with creating montages of images in both GeoCities and the Wide Web Scrape.

While creating montages was fun, it didn’t necessarily scale: up to a certain level, you find yourself clicking and searching around. I like creating them, think they make wonderful images and are useful on several levels, but it’s hardly harnessing the power of the computer. So I’ve been increasingly playing with various image analysis tools to distantly read. Continue reading

Great Quotation about the Value of History

From Brian Christian’s The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive – but actually not terribly relevant to the subject of the book at hand. On the value of history:

I was detachedly roaming the Internet, but there was nothing interesting happening in the news, nothing interesting happening on Facebook . . . I grew despondent, depressed – the world used to seem so interesting . . . But all of a sudden it dawned on me, as if the thought had just occurred to me, that much of what is interesting and amazing about the world did not happen in the past twenty-four hours. How had this fact slipped away from me? (Goethe: “He who cannot draw on three thousand years is living hand to mouth.”)
 

Anyways, I’ll probably see what my students in HIST 250: The Art and Craft of History think about this idea in the Fall.

Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago). Continue reading

Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course.. Continue reading

Getting Gephi Running on OS X Mavericks

Network of Texan Correspondence, a Historian's Macroscope Example.

Network of Texan Correspondence, a Historian’s Macroscope Example.

I’ve been using Gephi as I playtest through one of our final drafts of the Historian’s Macroscope – which has returned me to face one of my old nemesis.

Just because this has been a constant pain in the you know what, I wanted to post how I got it working in case it helps a fellow DHer. I suspect it’s a conflict with my JDK, as I know folks have trouble if they’ve been playing with MALLET and then switching over to Gephi. Anyways, it’s been constantly hanging (on the data laboratory screen), but only today did I actually force myself to fix it.

Open up /applications/gephi.app/contents/resources/gephi/etc/gephi.conf. On the command line, you can just run that with vim, or you can navigate to it yourself and open it up with your regular text editor.

[edited to add: Jonathan Goodwin pointed out that vim is kinda hard to learn - what I would recommend would be either: (a) use your 'finder' to go to the 'Gephi.app' icon in your 'application folder,' right click and select 'Show Package Contents' and then navigate from there, and then open up 'Gephi.conf' in your TextEdit program; or (b) use the command line to navigate to it, and again, if you don't use vim just type open . and you'll get a finder window there. Good luck!]

Under the line:

# default location of JDK/JRE, can be overridden by using --jdkhome switch

replace the existing value with:

jdkhome="/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home"

Hat-tip to the Gephi forums for this.

DH 2014 Slides and Talk: “Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource”

I gave a short paper at Digital Humanities 2014 last week. Held on the joint campus of the l’Université de Lausanne and the École polytechnique fédérale de Lausanne, it was a very good time. I might put a quick post up about some of my other observations, notably my visit to CERN, later on.

Here are the slides and text of my short talk. It’s a variation of the paper I also gave at the International Internet Preservation Consortium annual meeting in May. It concludes my conferencing for this summer – I have a workshop on computer vision and history during the Fall semester, and then the American Historical Association’s annual meeting in January. But until then: writing, research, and not flying in a plane.

Note: You should be able to reconstruct much of this approach from various blog posts on this site. See, for example: my WARC-Tools section, my clustering section, my NER posts, regular expression tinkering, and beyond (all my web archiving posts). Happy to chat, too. A downside of blogging so much is that many of will have seen much of this before.

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

DH Abstract available here

DH2014.001

Hi everybody. My name is Ian Milligan and I’m an assistant professor of Canadian and digital history at the University of Waterloo. In today’s short talk, I want to provide you with the approach I’ve been using to navigate large amounts of web information that uses open-source tools, provides an ad-hoc finding aid to me, and helps me make sense of the data deluge that historians are soon going to have to grapple with.

Using a case study of data that the Internet Archive released in October 2012, which is an entire scrape of the World Wide Web, I try to figure out what historians need to know to deal with this material. Continue reading

Three Tools for the Web-Savvy Historian: Memento, Zotero, and WebCite

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you'll get.

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you’ll get when you follow them. There’s no excuse for this anymore.

By Ian Milligan

“Sorry, the page you were looking for is no longer available.” In everyday web browsing, a frustration. In recreating or retracing the steps of a scholarly paper, it’s a potential nightmare. Luckily, three tools exist that users should be using to properly cite, store, and retrieve web information – before it’s too late and the material is gone!

Historians, writers, and users of the Web cite and draw on web-based material every day. Journal articles are replete with cited (and almost certainly uncited) digital material: websites, blogs, online newspapers, all pointing towards URLs. Many of these links will die. I don’t write this to be morbid, but to point out a fact. For example, if we search “http://geocities.com/” in Google Books we receive 247,000 results. Most of those are references to sites hosted on GeoCities that are now dead. If you follow those links, you’ll get the error that the “GeoCities web site you were trying to reach is no longer available.

What can we do? We can use three tools. Memento to retrieve archived web pages from multiple sources, WebCite to properly cite and store archived material, and Zotero to create your own personal database of archived snapshots. Let’s look at them all in turn. Continue reading