As followers of this blog know, one of my major research activities involves the exploration of the 80TB Wide Web Scrape, a complete scrape of the World Wide Web conducted in 2011 and subsequently released by the Internet Archive. Much of this to date has involved textual analysis: extracting keywords, running named entity recognition routines, topic modelling, clustering, setting up a search engine on it, etc. One myopia of my approach has been, of course, that I am dealing primarily with text whereas the Web is obviously a multimedia experience.
Inspired by Lev Manovich’s work on visualizing images [click here for the Google Doc that explains how to do what I do below], I wondered if we could learn something by extracting images from WARC files. I took the WARC files connected to the highest overall percentage of .ca domain files, drawing on my CDX work, and quickly used unar to decompress them. The files that I drew on were the ten WARC.GZ files from this collection, totally 10GB compressed or
I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here.
Here’s what happened (low-resolution version):
You can zoom in, and when you get down to the level of the individual image you get this. Bear in mind that this a decent sized file, even with the low resolution (if I develop this further I will probably double the resolution, which is doable). A side benefit is that the low resolution makes this kind of safe for work. It will not surprise people that there is a lot of pornography when you’re working with a random scrape of the web, although as this collection disproportionately drew on the .ca domain, there was not that much. Still, it’s a firehose of images from the Web and one has to worry about what they’ll find.
If you want the whole file, I’ve uploaded it here (26MB JPG). It will be very pixelated, but I think it strikes a balance between size and access. I have a higher resolution available on request (just leave a comment below).
Some initial thoughts. I came into this not knowing what I might find:
- Way, way, way more faces than I thought I would find. I have done similar montages before for the GeoCities.com archive, and there many of the images are little graphics, images, etc. By 2011, when this scrape is done, most of the images are of people. I would like to get a script going to get a fuzzy number of this, but first glance is that. Also, a lot of white people.
- There is remarkable coherence in certain batches of images. Some of this is from catalogues of course, like hamburgers or other forms of widgets, but you see these stretches of white space or yellow images, or whatever.
- Photographs rather than other forms of images seem to dominate? Again, this is in contrast to GeoCities. Digital photography has had a real impact.
- It is sobering. In a way, this is just a blast of everyday activity on the World Wide Web, snapped by the Internet Archive. People seem happy: lots of smiling, activities, things that happened in the past, etc. It’s just a blast of humanity, and maybe because it’s Friday, but I got pretty sentimental about this.
Anyways, I went into this not knowing what I would find, and these are just my initial, initial thoughts. More digging is to be had.