Exploring 50,000 Images from the Wide Web Scrape, Initial Thoughts

As followers of this blog know, one of my major research activities involves the exploration of the 80TB Wide Web Scrape, a complete scrape of the World Wide Web conducted in 2011 and subsequently released by the Internet Archive. Much of this to date has involved textual analysis: extracting keywords, running named entity recognition routines, topic modelling, clustering, setting up a search engine on it, etc. One myopia of my approach has been, of course, that I am dealing primarily with text whereas the Web is obviously a multimedia experience.

Inspired by Lev Manovich’s work on visualizing images [click here for the Google Doc that explains how to do what I do below], I wondered if we could learn something by extracting images from WARC files. I took the WARC files connected to the highest overall percentage of .ca domain files, drawing on my CDX work, and quickly used unar to decompress them. The files that I drew on were the ten WARC.GZ files from this collection, totally 10GB compressed or

I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here.

Here’s what happened (low-resolution version):

A low resolution version of the 50k file montage.

A low resolution version of the 50k file montage.

You can zoom in, and when you get down to the level of the individual image you get this. Bear in mind that this a decent sized file, even with the low resolution (if I develop this further I will probably double the resolution, which is doable). A side benefit is that the low resolution makes this kind of safe for work. It will not surprise people that there is a lot of pornography when you’re working with a random scrape of the web, although as this collection disproportionately drew on the .ca domain, there was not that much. Still, it’s a firehose of images from the Web and one has to worry about what they’ll find.

Zoomed in down to the level of the image

Zoomed in down to the level of the image

If you want the whole file, I’ve uploaded it here (26MB JPG)It will be very pixelated, but I think it strikes a balance between size and access. I have a higher resolution available on request (just leave a comment below).

Some initial thoughts. I came into this not knowing what I might find:

  • Way, way, way more faces than I thought I would find. I have done similar montages before for the GeoCities.com archive, and there many of the images are little graphics, images, etc. By 2011, when this scrape is done, most of the images are of people. I would like to get a script going to get a fuzzy number of this, but first glance is that. Also, a lot of white people.
  • There is remarkable coherence in certain batches of images. Some of this is from catalogues of course, like hamburgers or other forms of widgets, but you see these stretches of white space or yellow images, or whatever.
  • Photographs rather than other forms of images seem to dominate? Again, this is in contrast to GeoCities. Digital photography has had a real impact.
  • It is sobering. In a way, this is just a blast of everyday activity on the World Wide Web, snapped by the Internet Archive. People seem happy: lots of smiling, activities, things that happened in the past, etc. It’s just a blast of humanity, and maybe because it’s Friday, but I got pretty sentimental about this.

Anyways, I went into this not knowing what I would find, and these are just my initial, initial thoughts. More digging is to be had.

5 thoughts on “Exploring 50,000 Images from the Wide Web Scrape, Initial Thoughts

  1. Thanks Justin – my big fear is actually, after being scared by some folks, possible copyright concerns as I’m sure there are images owned by the big stock photo companies, etc. up there. I need to sit down and get some real advice. But if it’s a go, I appreciate this suggestion!

  2. Hello Ian, my name is Edson Barbosa and I am currently
    enrolled in the University of Franca – Brazil on a IT graduate program
    (High performance distributed data storage system), and would like to
    submit one of my research projects for the Emerging Leaders in the
    Americas Program ( ELAP 2014) so I can apply to an Information Retrieval graduation program at (UWaterloo).
    I would like your help as Canadian Supervisor in
    this study since my work presents an analysis and project for the
    development of a software that will meet the needs of data mining in
    the online retail sector, ie, collecting information on products of
    other companies and putting together on a solid database from real
    national retailer senary.
    Data are collected directly from the online retails stores (web scraping), and then
    transported to a local high performance distributed data storage
    system designed to support applications requiring maximum performance,
    scalability, and reliability. (and then can be used by governments
    like Brazilian City Hall`s Purchasing Services interested in data
    mining). Today I`m migrating this project from Java to Python and
    Hypertable (Big Data) databases and need help understanding some
    factors & analyzing this data to create a solution that make the
    difference.
    I created the first prototype on May 2010 when I was doing my
    graduation on Software Engineering at Federal University of Lavras,
    Brazil. And now, as I am doing my second graduation (Master) program
    here in Brazil, would like to continue this study in Canada, and be
    able to take this initiative to more countries, improving software
    design and making it accessible to other governments.
    Is it possible to do an online meeting with you (Skype/Google Plus)?
    (I will be glad to explain all of my research and have you as my
    supervisor.) 15 min of your time will be more than enough to explain.
    I have the support of my University here in Brazil, and also my course
    coordinator, that is going to provide me with all the documents needed
    to go to Canada on the ELAP program.
    Links:
    ELAP 2014 program:
    http://www.scholarships-bourses.gc.ca/scholarships-bourses/can/institutions/elap-pfla.aspx?lang=eng
    My first prototype and study:
    https://code.google.com/p/botejo/
    UFLA – Federal University of Lavras:
    https://www.ufla.br/
    Unifran – University of Franca:
    http://www.unifran.br/
    Canada Application Form:
    https://w01.scholarships-bourses.gc.ca/form-formulaire/leadership.aspx?lang=eng&pid=6
    My Brazilian Government Curriculum:
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4469946T3
    Skype: edsonlb
    Google Plus/Email: edsonlb [@gmail.com]
    Thank you so much for your attention!
    Edson Lopes Barbosa

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s