As followers of this blog know, one of my major research activities involves the exploration of the 80TB Wide Web Scrape, a complete scrape of the World Wide Web conducted in 2011 and subsequently released by the Internet Archive. Much of this to date has involved textual analysis: extracting keywords, running named entity recognition routines, topic modelling, clustering, setting up a search engine on it, etc. One myopia of my approach has been, of course, that I am dealing primarily with text whereas the Web is obviously a multimedia experience.
Inspired by Lev Manovich’s work on visualizing images [click here for the Google Doc that explains how to do what I do below], I wondered if we could learn something by extracting images from WARC files. I took the WARC files connected to the highest overall percentage of .ca domain files, drawing on my CDX work, and quickly used unar to decompress them. The files that I drew on were the ten WARC.GZ files from this collection, totally 10GB compressed or
I then used Mathematica to go through the decompressed archives, look for JPGs (as a start, I’ll expand the file type list later), and then transform each image into a 250 x 250 pixel square. As there were 50,680 images, this was a bit lower resolution than I normally use but I felt that this was ideal. Using Manovich’s documentation above, I then took these 50,680 images and created a montage of them. Each image was shrunk down even further so the file size would be manageable, and then I wouldn’t have to worry about copyright when I posted it here.
Here’s what happened (low-resolution version):

You can zoom in, and when you get down to the level of the individual image you get this. Bear in mind that this a decent sized file, even with the low resolution (if I develop this further I will probably double the resolution, which is doable). A side benefit is that the low resolution makes this kind of safe for work. It will not surprise people that there is a lot of pornography when you’re working with a random scrape of the web, although as this collection disproportionately drew on the .ca domain, there was not that much. Still, it’s a firehose of images from the Web and one has to worry about what they’ll find.

If you want the whole file, I’ve uploaded it here (26MB JPG). It will be very pixelated, but I think it strikes a balance between size and access. I have a higher resolution available on request (just leave a comment below).
Some initial thoughts. I came into this not knowing what I might find:
- Way, way, way more faces than I thought I would find. I have done similar montages before for the GeoCities.com archive, and there many of the images are little graphics, images, etc. By 2011, when this scrape is done, most of the images are of people. I would like to get a script going to get a fuzzy number of this, but first glance is that. Also, a lot of white people.
- There is remarkable coherence in certain batches of images. Some of this is from catalogues of course, like hamburgers or other forms of widgets, but you see these stretches of white space or yellow images, or whatever.
- Photographs rather than other forms of images seem to dominate? Again, this is in contrast to GeoCities. Digital photography has had a real impact.
- It is sobering. In a way, this is just a blast of everyday activity on the World Wide Web, snapped by the Internet Archive. People seem happy: lots of smiling, activities, things that happened in the past, etc. It’s just a blast of humanity, and maybe because it’s Friday, but I got pretty sentimental about this.
Anyways, I went into this not knowing what I would find, and these are just my initial, initial thoughts. More digging is to be had.
If file size is the obstacle to hosting the higher-resolution version here, you can just make an archive.org account and upload it to the Community Media area.
Thanks Justin – my big fear is actually, after being scared by some folks, possible copyright concerns as I’m sure there are images owned by the big stock photo companies, etc. up there. I need to sit down and get some real advice. But if it’s a go, I appreciate this suggestion!
Hello Ian, my name is Edson Barbosa and I am currently
enrolled in the University of Franca – Brazil on a IT graduate program
(High performance distributed data storage system), and would like to
submit one of my research projects for the Emerging Leaders in the
Americas Program ( ELAP 2014) so I can apply to an Information Retrieval graduation program at (UWaterloo).
I would like your help as Canadian Supervisor in
this study since my work presents an analysis and project for the
development of a software that will meet the needs of data mining in
the online retail sector, ie, collecting information on products of
other companies and putting together on a solid database from real
national retailer senary.
Data are collected directly from the online retails stores (web scraping), and then
transported to a local high performance distributed data storage
system designed to support applications requiring maximum performance,
scalability, and reliability. (and then can be used by governments
like Brazilian City Hall`s Purchasing Services interested in data
mining). Today I`m migrating this project from Java to Python and
Hypertable (Big Data) databases and need help understanding some
factors & analyzing this data to create a solution that make the
difference.
I created the first prototype on May 2010 when I was doing my
graduation on Software Engineering at Federal University of Lavras,
Brazil. And now, as I am doing my second graduation (Master) program
here in Brazil, would like to continue this study in Canada, and be
able to take this initiative to more countries, improving software
design and making it accessible to other governments.
Is it possible to do an online meeting with you (Skype/Google Plus)?
(I will be glad to explain all of my research and have you as my
supervisor.) 15 min of your time will be more than enough to explain.
I have the support of my University here in Brazil, and also my course
coordinator, that is going to provide me with all the documents needed
to go to Canada on the ELAP program.
Links:
ELAP 2014 program:
http://www.scholarships-bourses.gc.ca/scholarships-bourses/can/institutions/elap-pfla.aspx?lang=eng
My first prototype and study:
https://code.google.com/p/botejo/
UFLA – Federal University of Lavras:
https://www.ufla.br/
Unifran – University of Franca:
http://www.unifran.br/
Canada Application Form:
https://w01.scholarships-bourses.gc.ca/form-formulaire/leadership.aspx?lang=eng&pid=6
My Brazilian Government Curriculum:
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4469946T3
Skype: edsonlb
Google Plus/Email: edsonlb [@gmail.com]
Thank you so much for your attention!
Edson Lopes Barbosa
Hi Edson,
Thanks for your comment. I’ll send an e-mail to your gmail account shortly,
Ian