I’m back from the great annual Digital Humanities conference in Krakow (and a nice, two-week follow-up vacation), and have returned to the always growing warcbase platform. One of our research assistants, Youngbin Kim, has been working on some image extraction commands and I was looking forward to putting it to the test.
Finding popular images can be difficult. In the past, we have used filename-based frequency, or have used the actual images themselves (if hotlinked to each other), but that hasn’t been sufficient. Hotlinking was generally frowned upon in GeoCities, given bandwidth limitations (a hotlinking user was stealing bandwidth from other users). Filenames are also not not always descriptive (i.e.
An idea around this is to play with the unique hash of each image. In the past, I’ve used hashes when calculating the frequency of popular images in the GeoCities Archive Team torrent. The problem with my method was that it didn’t really scale: we want to make sure everything works within a cluster. And now that we have a set of WARCs from the Internet Archive, let’s try to see what we can do with them…
Jimmy Lin suggested adding in a user-defined function to warcbase that would help us generate MD5 hashes to find the most popular images.
The result: the
ExtractPopularImages UDF. Here’s an example of it in use:
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ import org.warcbase.spark.matchbox.RecordLoader val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/*.warc.gz",sc).persist() ExtractPopularImages(r, 2000, sc).saveAsTextFile("2000-Popular-Images-Geocities14")
This goes into the GeoCities web archive, calculates the hash of each image, and then finds the 2000 most popular images. The results look like:
94493 http://i43.photobucket.com/albums/e378/fraydknot/Fraydknot%20Gallery/WebPics/Frayedknot_Gallery_m33i.jpg 22126 http://counter38.bravenet.com/counter.php?id=338391&usernum=3251843748 17853 http://www.fortunecity.com/millennium/rainbow/598/goodluck.jpg 17740 http://www.geocities.com/Heartland/Village/6232/gc_icon.gif
To grab the images, we used some simple regexes to prepend an Internet Archive Wayback Machine URL, which we then slowly
wgeted down (with
wget --content-on-error -w 1 --limit-rate=50k -i 2000-Popular-Images-GeoCities-just-URLs.txt).
I then montaged – inspired by Nick Ruest’s other amazing montage work with web archives – the results with:
montage ./* ~/geocities-top-2000.png
The montage (you can download the 70MB PNG here):
At first glance, not too useful. But when we begin to zoom in, we can a see a few things. The most common elements borrowed from each other were navigational elements and common animated GIFs (spinning globes, books with pages flipping, smiling faces, Internet Explorer advertisements).
But when we move past those and into the content specific ones, we can begin to see some themes that appear.
Reflections on the potential of GeoCities and the Web more generally:
Opposition to pornography:
Or even just frivolity:
I think we should keep building on this. It might be interesting to be able to generate a list of websites that contain a given image – i.e., all the websites that contained a given anti-pornography badge, or that implored readers to support the troops. That might be another interesting way to explore web archived content!