ImagePlot, developed by Lev Manovich’s Software Studies Initiative, promises to help you “explore patterns in large image collections.” It doesn’t disappoint. In this short post, I want to demonstrate what we can learn by visualizing the 243,520 images of all formats that make up the child-focused EnchantedForest neighbourhood of the GeoCities web archive.
Setting it Up
Loading web archived images into ImagePlot (macros which work with the open-source program ImageJ) requires an extra step, which works for both Wide Web Scrape as well as GeoCities data. Images need to be 24-bit RGB to work. My experience was that weird file formats broke the macros (i.e. an ico file, or other junk that you do get in a web archive), so I used ImageMagick to convert the files. (more…)
I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.
This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.
The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.
As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right? (more…)
Full confession: I rely too heavily on text. It was a common proviso of the talks I gave this summer on my work, which focused on the workflow that I used for taking WARC files, implementing full-text search, and extracting meaning from it all. What could we do if we decided to extract images from WARC files (or other forms of web archives) and began to distantly read them? I think we could learn a few things.
While creating montages was fun, it didn’t necessarily scale: up to a certain level, you find yourself clicking and searching around. I like creating them, think they make wonderful images and are useful on several levels, but it’s hardly harnessing the power of the computer. So I’ve been increasingly playing with various image analysis tools to distantly read. (more…)
A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.
These are from the 2011 Wide Web Scrape web archive.
The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.
The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course..(more…)
In my last post, I walked people through my thoughts as I explored a large number of images from the Wide Web Scrape (using, as noted there, methods from Lev Manovich). In this post, I want to put up three images and think about how this method might help us as historians. Followers of my research might know that I am also playing around with the GeoCities web archive. GeoCities was arranged into neighbourhoods, from the child-focused EnchantedForest to the Heartland of family and faith or the car enthusiasts of MotorCity. Each neighbourhood was, in some ways, remarkably homogenous.
Let’s take every JPG from the ‘Athens’ (the teaching/philosophy/etc. focused area of GeoCities) and see what we find. (more…)
I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here. (more…)