WARCing Away: Shrinking Down to Canadian Text, Making It Digestible

I’ve been working on a few different projects: developing abstract tools (my ‘HistoryCrawler’ idea) from the 80TB Wide Scrape, and then implementing them on specific research questions within a historical collection of websites. Things have been really productive, which is both great but also a bit bittersweet as I’ll be having to turn my attention to teaching come mid-August. But that’s part of the fun of juggling teaching, research, and service.

In any case, I left us off with “Finding Specific Websites and Creating Full-Text Indexes Using WARC Tools.” I downloaded 101.22 GB of WARC files, which within them contained 397,221 webpages from the dot.ca domain. So just how much information did I grab in those WARC files, that I’m now filtering?

Well it was a fun experiment: this is the scale of Big Data that historians might have to work with one day.

Top Ten WARC.GZ Files, Some Statistics

Totals: 1,070,630,949 words. That’s over a billion words of information.
Websites: 1,348,676 websites, of which 397,221 are from Canadian domains.
Size: 101.22 GB compressed as WARC.GZ files, which shrinks down to 15.93GB of plain text when run through Lynx.
Broken Down by File: The mean WARC.GZ file contains 13,486.8 websites, and 10,706,309 words.
Canadian Content: If we just take the dot.ca in those 15.93GB of plain text, we reduce it down to the dot.ca sites that have textual data: which is 254,176 sites. This is ‘only’ 196,286,341 words, or 2.67GB of plain text. We’re actually starting to get into the realm of manageability now.

i.e. we can ingest this into Solr easily and have a clustering search engine up on this chunk of the Canadian Internet, which is – by a very rough estimate – possibly 5% of the publicly-accessible Canadian internet.

Onwards!

One thought on “WARCing Away: Shrinking Down to Canadian Text, Making It Digestible

Leave a comment