I had the great pleasure to be a speaker at the Ethics and Archiving the Web conference at the New Museum in New York City. My own contribution to the conference was a piece on the “Ethics of Studying GeoCities.”

The livestream of the whole conference is available here.

Hi everybody and thanks so much for coming to my talk today. What I want to do is discuss the "ethics of studying GeoCities," which to me gets at both the potential but also the risks of doing a lot of this web archival research.

New Chapter: “Welcome to the Web: The Online Community of GeoCities during the early years of the World Wide Web”

First page of the article.

Well, I certainly won’t win any awards for “most concise chapter title,” but my latest publication “Welcome to the Web: The Online Community of GeoCities during the early years of the World Wide Web,” is now available in the open-access publication The Web as History. This book, edited by Niels Brügger and Ralph Schroeder, has been published by UCL Press. They’re an innovative, fully-open access university press. You can download the entire book as a PDF, or also purchase paperback or hardback copies if you so desire.

Anyways, please do feel free to read the chapter if it strikes your fancy. Here's an excerpt from the introduction below the fold:

Finding Popular Images within a Web Archive: Exploring GeoCities

I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.

Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).

An idea around this is to play with the unique hash of each image. In the past, I've used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn't really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let's try to see what we can do with them…

Exploring the GeoCities Web Archive with Warcbase & Spark: Links (or how we can use warcbase to find amazing sites to ask historical questions!)

Not just spaghetti and meatballs, but the starting point for research.

In my last post, we left off with scripts running to extract all URLs and a link diagram. They finished decently quickly – about three days on our rho server at York University, or about 30 minutes on our roaringly-fast cluster. Given that hopefully we will be running these only once or twice at first, even three days isn’t too bad.

This is a relatively technology-heavy post, but I think even relative newcomers to digital history might find this interesting as an entree into several tools that we might use to extract meaningful historical information from big datasets.

Read More »

Exploring the GeoCities Web Archive with Warcbase & Spark: Getting Started

Nick Ruest and I had some great news a few weeks ago: a collection of GeoCities WARCs was on its way on a few hard drives. I’ve previously done quite a bit of work on the GeoCities torrent, but as we’ve been doing parallel development on warcbase while working with the torrent, it’s been difficult to have one set of tools talk to our earlier dataset. Once we have all the files in WARC format, as we do now, we can use warcbase to generate derivative datasets.

Everybody should win, in theory, as it both helps research into GeoCities, research into warcbase, and research into web archival use more generally.

Step One: Ingesting the Data

Once the hard drives arrived, it was fun to watch the data populate our server as Nick supervised the time-consuming job of moving over 4TB of data from two hard drives onto our server at York University.

Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.

Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities

I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell). 

If you want to see the poster, please click here.


Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.

GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.

These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable.

Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago).

Testing Cohesiveness in GeoCities Neighbourhoods by Extracting and Plotting Locations

The 'heartland' of GeoCities?
The ‘heartland’ of GeoCities?

This weekend, I went back to my old GeoCities archive to play around with the methods I experimented with in my last post on the Wide Web Scrape. One question that I’ve been curious about was whether GeoCities was a community (drawing on an old debate that waged in the 1990s and beyond about virtual communities), and how we could see that: in volunteer networks, neighbourhood volunteers, web rings, guestbooks, links to each other, etc. As with all of my blog posts, this is just me jotting a few things down (the lab notebook model), not a fully thought-out peer reviewed piece. But keep reading for the pictures and discussion. 🙂

GeoCities and Neighbourhoods: An Extremely Short Introduction

Welcome to the Neighbourhood! (December 20th, 1996) - click through for the Wayback Machine version of this page.
Welcome to the Neighbourhood! (December 20th, 1996) – click through for the Wayback Machine version of this page.

Before its 1999 acquisition by Yahoo!, GeoCities was arranged into a set of neigbourhoods. It was presented as a “cityscape,” with streets, street numbers, recognizable urban and geographical landmarks (one might live on a virtual Fifth Avenue or on a festive Bourbon Street), and this was strongly emphasized in press releases and user communications. The central metaphor that governed the admission of new users into GeoCities was that of homesteading. A conscious choice, keeping with the spirit of the frontier so common during the early days of the Web (harkening to its communalist roots), as it captured the then-common heady expansionary rhetoric. Users would need new homes, and these new homes would be located in neighbourhoods. This sat with the visions of GeoCities founders David Bohnett and John Rezner (who joined in August 1995), who saw “[n]eighbourhoods, and the people that live in them, [as providing] the foundation of community.” When selecting sites, users were presented with a list of various places where their site might belong. Those writing about “[e]ducation, literature, poetry, philosophy” would be encouraged to incorporate their site into Athens; political wonks to CapitolHill; small businesspeople or those working from home in Eureka, and beyond. Some neighbourhoods came with restrictions and guidance, such as the more protective and censored EnchantedForest for children. Others were much wider in scope, such as “Heartland” focusing on “families, pets, hometown values.”

I could talk your ear off (that paragraph is a compression of several pages I've written) but you should get the broad picture by now.