Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.

Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities

I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell). 

If you want to see the poster, please click here.

Milligan_LT_Big_Data

Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.

GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.

These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Continue reading

Using ImagePlot to Explore Web Archived Images

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

ImagePlot, developed by Lev Manovich’s Software Studies Initiative, promises to help you “explore patterns in large image collections.” It doesn’t disappoint. In this short post, I want to demonstrate what we can learn by visualizing the 243,520 images of all formats that make up the child-focused EnchantedForest neighbourhood of the GeoCities web archive.

Setting it Up

Loading web archived images into ImagePlot (macros which work with the open-source program ImageJ) requires an extra step, which works for both Wide Web Scrape as well as GeoCities data. Images need to be 24-bit RGB to work. My experience was that weird file formats broke the macros (i.e. an ico file, or other junk that you do get in a web archive), so I used ImageMagick to convert the files. Continue reading

Colour Analysis of Web Archives

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?

I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.

This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.

The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.

As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right?  Continue reading

Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago). Continue reading

Three Tools for the Web-Savvy Historian: Memento, Zotero, and WebCite

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you'll get.

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you’ll get when you follow them. There’s no excuse for this anymore.

By Ian Milligan

“Sorry, the page you were looking for is no longer available.” In everyday web browsing, a frustration. In recreating or retracing the steps of a scholarly paper, it’s a potential nightmare. Luckily, three tools exist that users should be using to properly cite, store, and retrieve web information – before it’s too late and the material is gone!

Historians, writers, and users of the Web cite and draw on web-based material every day. Journal articles are replete with cited (and almost certainly uncited) digital material: websites, blogs, online newspapers, all pointing towards URLs. Many of these links will die. I don’t write this to be morbid, but to point out a fact. For example, if we search “http://geocities.com/” in Google Books we receive 247,000 results. Most of those are references to sites hosted on GeoCities that are now dead. If you follow those links, you’ll get the error that the “GeoCities web site you were trying to reach is no longer available.

What can we do? We can use three tools. Memento to retrieve archived web pages from multiple sources, WebCite to properly cite and store archived material, and Zotero to create your own personal database of archived snapshots. Let’s look at them all in turn. Continue reading