I had the great pleasure to be a speaker at the Ethics and Archiving the Web conference at the New Museum in New York City. My own contribution to the conference was a piece on the “Ethics of Studying GeoCities.”
Hi everybody and thanks so much for coming to my talk today. What I want to do is discuss the “ethics of studying GeoCities,” which to me gets at both the potential but also the risks of doing a lot of this web archival research.Read More »
I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.
Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).
In my last post, we left off with scripts running to extract all URLs and a link diagram. They finished decently quickly – about three days on our rho server at York University, or about 30 minutes on our roaringly-fast cluster. Given that hopefully we will be running these only once or twice at first, even three days isn’t too bad.
This is a relatively technology-heavy post, but I think even relative newcomers to digital history might find this interesting as an entree into several tools that we might use to extract meaningful historical information from big datasets.
Nick Ruest and I had some great news a few weeks ago: a collection of GeoCities WARCs was on its way on a few hard drives. I’ve previously done quite a bit of work on the GeoCities torrent, but as we’ve been doing parallel development on warcbase while working with the torrent, it’s been difficult to have one set of tools talk to our earlier dataset. Once we have all the files in WARC format, as we do now, we can use warcbase to generate derivative datasets.
Everybody should win, in theory, as it both helps research into GeoCities, research into warcbase, and research into web archival use more generally.
Step One: Ingesting the Data
Once the hard drives arrived, it was fun to watch the data populate our server as Nick supervised the time-consuming job of moving over 4TB of data from two hard drives onto our server at York University.Read More »
Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.
I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell).
Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.
GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.
These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Read More »
I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.
What We Can Learn?
We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.
This weekend, I went back to my old GeoCities archive to play around with the methods I experimented with in my last post on the Wide Web Scrape. One question that I’ve been curious about was whether GeoCities was a community (drawing on an old debate that waged in the 1990s and beyond about virtual communities), and how we could see that: in volunteer networks, neighbourhood volunteers, web rings, guestbooks, links to each other, etc. As with all of my blog posts, this is just me jotting a few things down (the lab notebook model), not a fully thought-out peer reviewed piece. But keep reading for the pictures and discussion. 🙂
GeoCities and Neighbourhoods: An Extremely Short Introduction
Before its 1999 acquisition by Yahoo!, GeoCities was arranged into a set of neigbourhoods. It was presented as a “cityscape,” with streets, street numbers, recognizable urban and geographical landmarks (one might live on a virtual Fifth Avenue or on a festive Bourbon Street), and this was strongly emphasized in press releases and user communications. The central metaphor that governed the admission of new users into GeoCities was that of homesteading. A conscious choice, keeping with the spirit of the frontier so common during the early days of the Web (harkening to its communalist roots), as it captured the then-common heady expansionary rhetoric. Users would need new homes, and these new homes would be located in neighbourhoods. This sat with the visions of GeoCities founders David Bohnett and John Rezner (who joined in August 1995), who saw “[n]eighbourhoods, and the people that live in them, [as providing] the foundation of community.” When selecting sites, users were presented with a list of various places where their site might belong. Those writing about “[e]ducation, literature, poetry, philosophy” would be encouraged to incorporate their site into Athens; political wonks to CapitolHill; small businesspeople or those working from home in Eureka, and beyond. Some neighbourhoods came with restrictions and guidance, such as the more protective and censored EnchantedForest for children. Others were much wider in scope, such as “Heartland” focusing on “families, pets, hometown values.”
I could talk your ear off (that paragraph is a compression of several pages I’ve written) but you should get the broad picture by now.Read More »