Finding Popular Images within a Web Archive: Exploring GeoCities

I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.

Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e. 00015b.gif).

An idea around this is to play with the unique hash of each image. In the past, I’ve used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn’t really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let’s try to see what we can do with them…

Jimmy Lin suggested adding in a user-defined function to warcbase that would help us generate MD5 hashes to find the most popular images.

The result: the ExtractPopularImages UDF. Here’s an example of it in use:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.RecordLoader

val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/*.warc.gz",sc).persist()
ExtractPopularImages(r, 2000, sc).saveAsTextFile("2000-Popular-Images-Geocities14")

This goes into the GeoCities web archive, calculates the hash of each image, and then finds the 2000 most popular images. The results look like:


To grab the images, we used some simple regexes to prepend an Internet Archive Wayback Machine URL, which we then slowly wgeted down (with wget --content-on-error -w 1 --limit-rate=50k -i 2000-Popular-Images-GeoCities-just-URLs.txt).

I then montaged – inspired by Nick Ruest’s other amazing montage work with web archives – the results with:

montage ./* ~/geocities-top-2000.png

The montage (you can download the 70MB PNG here):

Screen Shot 2016-08-03 at 12.46.33 PM

At first glance, not too useful. But when we begin to zoom in, we can a see a few things. The most common elements borrowed from each other were navigational elements and common animated GIFs (spinning globes, books with pages flipping, smiling faces, Internet Explorer advertisements).

But when we move past those and into the content specific ones, we can begin to see some themes that appear.


Screen Shot 2016-08-03 at 12.48.38 PM

Reflections on the potential of GeoCities and the Web more generally:

Screen Shot 2016-08-03 at 12.49.29 PM

Opposition to pornography:

Screen Shot 2016-08-03 at 12.47.04 PM

Or even just frivolity:

Screen Shot 2016-08-03 at 12.48.48 PM

What’s next?

I think we should keep building on this. It might be interesting to be able to generate a list of websites that contain a given image – i.e., all the websites that contained a given anti-pornography badge, or that implored readers to support the troops. That might be another interesting way to explore web archived content!

One thought on “Finding Popular Images within a Web Archive: Exploring GeoCities

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s