224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).
224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it? (more…)

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.
A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

ImagePlot, developed by Lev Manovich’s Software Studies Initiative, promises to help you “explore patterns in large image collections.” It doesn’t disappoint. In this short post, I want to demonstrate what we can learn by visualizing the 243,520 images of all formats that make up the child-focused EnchantedForest neighbourhood of the GeoCities web archive.

Setting it Up

Loading web archived images into ImagePlot (macros which work with the open-source program ImageJ) requires an extra step, which works for both Wide Web Scrape as well as GeoCities data. Images need to be 24-bit RGB to work. My experience was that weird file formats broke the macros (i.e. an ico file, or other junk that you do get in a web archive), so I used ImageMagick to convert the files. (more…)

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.

This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.

The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.

As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right?  (more…)

In my last post, I took all of the CDX files for the 80TB Wide Crawl and began to count for how often the .ca domain appeared in each file. Once I had all that data, the next step was to sort it (code) so I could pick out the most relevant ones.

Each entry in this database was:

{FILE NUMBER, CDX, FREQUENCY}

Sorting by the latter generated a list of crawls that had the most .ca data. There was a discrepancy with the data provided by the Internet Archive, in that I identified 8,885,682 .ca domains and the Internet Archive identified 8,512,275 .ca domains. After double checking for duplicates that may have crept in there, I’m not finding any. However, those numbers are close enough for comfort at this stage.

It was an interesting window on how the .ca domains were distributed throughout the Wide Crawl. The mean repository had 1012.5 .ca sites, and the median was 732. There isn’t a magic bullet to just get the “.ca Internet” out of these files, of course, but we can find some case study files that we can hope have the largest amount of Canadian content. (more…)