Finding .ca domains in the 80TB Wide Crawl

Command line text showing successful completion of a wget download, 2d 1h 43m 44s; 111690 files, 349GBI’m working in public again, as I begin to play with the 80TB World Wide Web Crawl that the Internet Archive made available to celebrate its attainment of the 10 Petabyte mark. I’m hoping to make this one of my primary research projects over the next two years and have some pretty decent support to make this happen. If you’re reading this and you think ‘hey, I have an idea,’ please share via e-mail or in the comments below.

What I wanted to do was find clusters of Canadian websites. As the Internet Archive notes in its analytics, there are 8,512,275 distinct URLs with the top-level .ca domain. Sounds like a lot, and it is, but that’s a tiny subset of the overall 2,273,840,159 collection. So the trick will be to pick out a few WARCs to focus on that are more manageable for our resources here.

Luckily, we can find those domains via the index (CDX) files provided for the crawl. Right now, I’m crunching away (code on GitHub):

What I’ve done so far:

(1) Downloaded all the CDX files for the entire crawl: This was a big step, but luckily the Internet Archive supports downloading in bulk with wget. And wget even has a Programming Historian 2 lesson!

This was a big download: a little over two days, 111,690 files, and 349GB (run over the weekend in my office, pretty much on autopilot although I popped in a few times to make sure ITS hadn’t pulled the plug). All plain text, records of all the domains.

Note to the authorities: this is why I ask for equipment and storage money. 🙂

(2) Find the .ca domains: Luckily, the CDX file is well arranged. Mathematica handles the unzipping and exploration of the files well, and each URL gets an entry like this:

{za,co,ireneguesthouse)/robots.txt 20110815151754 text/html 404 AL3MTEPBMLJDH2EWR6LJ5XKO3HSY6XSQ - - 952 47026507 WIDE-20110815150840-crawl415/WIDE-20110815150840-00471.warc.gz}`

Luckily, note the first array there – the domain broken down, so if it’s a .ca the first entry will be ca, if it’s a com the first entry will be com, and so forth.

So right now I have a program crunching away to tell me how many entries in these 111,690 CDX files begin with a “ca” entry. It’s not going to be perfect, it’ll miss a lot of Canadian sites, but it’s a starter.

This is the code:

files = FileNames["*crawl*cdx.gz", {"*"}, Infinity];
output = "/users/ianmilligan1/desktop/cdx-results.txt";
str = OpenWrite[output];
x = 1;


vals = Import[file, "Data"];
res = Cases[Tally[Drop[vals, 1][[All, 1]]], {"ca", _}];
finding =
ToString[file] <> "," <> ToString[Flatten[res][[2]]] <> "\n";
WriteString[str, finding];
x = x + 1;
, {file, files}];


and this is the output so far, for example:


So we’ll find some frequent appearances and then know what WARC files we want to play with in particular.

It’ll be a long one. So far, it’s at 6% after a few hours. But a few nights and this’ll be working. Then we’ll start looking at this a bit closer.

Posted In:

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s