Colour Analysis of Web Archives

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.

This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.

The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.

As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right? 

So how did this work?

To cut down on the amount of things that could go wrong, I created large montage images of all the images within a web archive, compressed it so that it was manageable, and then ran this program on the big ones. The only real adaptation to the code required a tweak to the colorArea function, which is now:


The other three functions, colorLabeledChart, combineColorAreas, and colorName are the same as in the tutorial. I kept the tolerance value at 0.4, after tinkering with it quite a bit. It strikes me as a reasonable value to get sufficient granularity.

It was then as simple as running


Where file = the name of a montage. Results took ~ 30-45 minutes for the big top-level domains. Colour names come from Mathematica‘s ColorData function, which leads to some fun colour names. If you’ve ever painted a house, though, just imagine – you can just pick the ‘Web’ colour scheme?

What did I find?

Some Findings

Let’s start with two top-level domains from the Wide Web Scrape. The first is .ca:

Frequent colours from .ca Top-Level Domain images.
Frequent colours from .ca Top-Level Domain images.

And .com:

Frequent colours from .com Top-Level Domain images.

There are noted differences between the two, but the colours are relatively demur and to some degree, earth-tone like (says the guy with colour blindness, but bear with me). In .ca we have white (or MintCream), two shades of gray, black, firebrick red, blue, and so forth; in .com we have again two shades of gray, white (or Snow), Indigo (a sort of blue, I guess?),
, a brown, a light blue, and an orange.

Compare that to a sample of images from GeoCities (in this case the EnchantedForest neighbourhood), which are far more vibrant:

Image colours from the EnchantedForest.
Image colours from the EnchantedForest.

Gone are the domination of greys, instead, lots of blues, some green, and far more colours evenly represented. We see similarly vibrant colours in other GeoCities neighbourhoods too. In this, they’re more similar – maybe – to another top-level domain, the .CN:

Frequent colours from .cn Top-Level Domain images.

There, we see white, a bit of gray, black, but then goldenrod, blue, burnt seanna, and so forth. A bit more vibrant, perhaps, than the first two top-level domains I have looked at.

So what?

On their own, none of this is blowing my mind. They’re minute differences. But, I think coupled with eye analysis of montaged images (an imperfect visualization, yes, but it lets me get a good birds-eye view on what’s going on), recurrent images, and a sense of ImageMagick metadata, we’re beginning to get some good data to think about the graphical content of a web archive. With some test corpuses – i.e. collection of randomized digital images (?) versus a collection of clipart (?) – perhaps we can begin to learn something about what we’ll find from colour profiles?

As usual, the blog is just me thinking out loud. The next step will be to begin to actually analyze all this data that’s now sitting on my system.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s