At IIPC 2014 two weeks ago, I saw a great presentation by the CommonCrawl team. They’re a “non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.” They host their material as Amazon Public Data Sets, with an eye to making this sort of data accessible to those without terabytes upon terabytes of hard drives sitting around – and you can easily harness the power of Amazon Web Services to access it.
The catch was that this historian had never used AWS before to access public data (I’ve spun a few systems up and down in the past to tinker, but that’s about it). It took me a bit of time as there wasn’t a catch-all place for documentation, so I thought I’d write a quick post in case it might help another humanist following along. Newbie alert: Those who have used AWS before will probably learn very little from this post. But the person like me who wants to access the data but has little experience w/ this particular environment might benefit.
My starting point was their “Common Crawl WARC Examples” repository. They mentioned S3Cmd which has become my go-to way to navigate these repositories. To install S3Cmd you can download it from their GitHub repository. If you’re new to GitHub on the command line, this is very helpful. If you’re new to the command line, stay tuned, because we’ll have something up on the Programming Historian 2 soon. 🙂
To begin, then:
git clone https://github.com/s3tools/s3cmd.git
And then the standard installation procedure, as documented in the ‘INSTALL’ file (a bit hidden away as it’s not on the readme.md file)
python setup.py install (may need to run with
Once it’s installed, you’ll want to test if s3cmd is running. To do so, try typing:
If it’s installed, you’ll see the
--help page by default. To get started, let’s configure it.
s3cmd --configure begins an automatic configuration process. You’ll need two things: your access key, and your secret key. To get that, you’ll need to create an AWS account, navigate to your ‘account settings, and display your security credentials. This URL will work if you’re logged in.
When you create a new access key, you’ll get two – the Access Key ID and also a secret key, that you can only download and save at that moment. Save it in a secure place, and enter it into the console. There are further options around encryption, proxy access, etc., but unless those specifically apply to you – and you’re just futzing around like I am – I left those blank.
Once we’re set up, we can start navigating the file structure as we would on our own console. For example, try:
s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/
We know we have a sub-directory in that directory known as
segments. So we can continue, a la:
s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/
Until we get into the repository that we need.
In the future, I need to start figuring out how to use AWS so I’m not pulling stuff down, but I’m now able to see where the WARC (the raw files that I normally work with), the WAT (metadata), and WET (plaintext – awesome!!!!) files are. File formats are documented here. My programs only work on my own computer, so I’ll be pulling some down to locally work with.
But at the minimum, the CommonCrawl will let me do some longitudinal research, which is super exciting.