Back to School (Teaching for Fall 2014)

I think even my colleagues are surprised when they’re reminded that I’m beginning my third year at the University of Waterloo. Last year was a fantastic one: great students, fun colleagues, and getting more involved in the life of the university (from sitting on doctoral committees, Master’s committees, and getting to attend fun events).

I’m teaching four courses this year, two in the Fall and two in the Winter. This term, I’m teaching two classes: the second-year historical methodology course for our majors, minors, and lovers-of-history, and a fourth-year honours seminar on Canadian social movements. If you’re curious, feel free to click through on the syllabus thumbnails below to read the whole things. As with every class, there are things that have been left out, but that’s the way of things!

My digital history course, which a few folks on Twitter have asked about, will be offered again in Winter 2015.

History 250 - The Art and Craft of History

History 250 – The Art and Craft of History

History 403A - Canadian Honours Seminar

History 403A – Canadian Honours Seminar

Using ImagePlot to Explore Web Archived Images

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

ImagePlot, developed by Lev Manovich’s Software Studies Initiative, promises to help you “explore patterns in large image collections.” It doesn’t disappoint. In this short post, I want to demonstrate what we can learn by visualizing the 243,520 images of all formats that make up the child-focused EnchantedForest neighbourhood of the GeoCities web archive.

Setting it Up

Loading web archived images into ImagePlot (macros which work with the open-source program ImageJ) requires an extra step, which works for both Wide Web Scrape as well as GeoCities data. Images need to be 24-bit RGB to work. My experience was that weird file formats broke the macros (i.e. an ico file, or other junk that you do get in a web archive), so I used ImageMagick to convert the files. Continue reading

Colour Analysis of Web Archives

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?

I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.

This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.

The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.

As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right?  Continue reading

Using Images to Gain Insight into Web Archives?

Do you like animated GIFs? Curious about what this? Then read on.. :)

Do you like animated GIFs? Curious about what this? Then read on.. :)

Full confession: I rely too heavily on text. It was a common proviso of the talks I gave this summer on my work, which focused on the workflow that I used for taking WARC files, implementing full-text search, and extracting meaning from it all. What could we do if we decided to extract images from WARC files (or other forms of web archives) and began to distantly read them? I think we could learn a few things.

A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.

A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.

This continues from my work discussed in my “Image File Extensions in the Wide Web Scrape” post which provides the basics of what I did to begin to play with images. It also touches on work I’ve done with creating montages of images in both GeoCities and the Wide Web Scrape.

While creating montages was fun, it didn’t necessarily scale: up to a certain level, you find yourself clicking and searching around. I like creating them, think they make wonderful images and are useful on several levels, but it’s hardly harnessing the power of the computer. So I’ve been increasingly playing with various image analysis tools to distantly read. Continue reading

Great Quotation about the Value of History

From Brian Christian’s The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive – but actually not terribly relevant to the subject of the book at hand. On the value of history:

I was detachedly roaming the Internet, but there was nothing interesting happening in the news, nothing interesting happening on Facebook . . . I grew despondent, depressed – the world used to seem so interesting . . . But all of a sudden it dawned on me, as if the thought had just occurred to me, that much of what is interesting and amazing about the world did not happen in the past twenty-four hours. How had this fact slipped away from me? (Goethe: “He who cannot draw on three thousand years is living hand to mouth.”)
 

Anyways, I’ll probably see what my students in HIST 250: The Art and Craft of History think about this idea in the Fall.

Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago). Continue reading

Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course.. Continue reading