Reading WARC Records with Mathematica

Our notebook. Click through to find it.

Our notebook. Click through to find it.

Our project team uses a number of languages: Scala with warcbase, lots of shell commands when manipulating and analyzing textual data (especially social media, as Nick and I wrote about here), and Mathematica when we want to leverage the power and relative simplicity of that language.

William J. Turkel and I have been working a bit on getting WARC files to play with Mathematica. For larger numbers of files, warcbase is still the solution. But for a small collection – say a few WARCs created with webrecorder.io – this might be a lighter-weight approach. Indeed, I can see myself doing this if I went out around the web with WebRecorder, grabbed some sites (say public history sites or the like), and wanted to do some analysis on it.

Bill and I developed this together: he cooked up the record to association bit (which is really the core of this code), and I worked on getting us to be able to process entire WARCs and generate some basic analysis. It was also fun getting back into Mathematica, after living in Scala and Bash.

I’ll walk you through the notebook, which you can also find in this repository (all MIT License).

Step One: Import the File and Set up a Record

The first two commands finds the files, and then opens it. Right now, we have a hacky workthrough where you unzip the .warc.gz file and rename it as a txt.

sampleWARCFile = 
  "/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/\
ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.\
archive.org-8091.warc.txt";

wFile = OpenRead[sampleWARCFile];

We then have a module that basically reads WARC files, reading data from the various fields contained within (both metadata and content).

recordToAssociation[rec_] :=
 Module[{recmod, fields, vals},
  recmod = StringReplace[rec, "\r\n\r\n" -> "\r\n\r\nCONTENT: "];
  fields = 
   StringTrim[StringExtract[StringSplit[recmod, "\r"], ":" -> 1]];
  vals = Map[
    StringTrim[
      StringReplace[#, 
       Shortest[StartOfString ~~ Except[":"] ..] ~~ ":" -> ""]] &, 
    StringSplit[recmod, "\r"]];
  Return[Association[
    DeleteCases[MapThread[Rule, {fields, vals}], 
     Rule["", ""] | Rule["CONTENT", ""]]]]]

Let’s see what this looks like in action.

tempRecord = Read[wFile, Record, RecordSeparators -> {"WARC/1.0" }];

tempRecordAssociation = recordToAssociation[tempRecord];

Keys[tempRecordAssociation]

The results to this are:

{"WARC-Type", "WARC-Date", "WARC-Filename", "WARC-Record-ID", \
"Content-Type", "Content-Length", "CONTENT", "ip", "hostname", \
"format", "conformsTo", "operator", "publisher", "isPartOf", \
"description", "robots", "http-header-user-agent", "http-header-from"}

We can keep running that cell and walk through the different fields that we can find. i.e. a few records later, we’d see:

{"WARC-Type", "WARC-Target-URI", "WARC-Date", "WARC-Concurrent-To", \
"WARC-Record-ID", "Content-Type", "Content-Length", "CONTENT", \
"User-Agent", "From", "Connection", "Referer", "Host", "Cookie"}

Basically, you begin to figure out what you might be interested in.

Step Two: Process Content of WARC as a Stream, Visualize some Metadata Fields

Given the list of keys above, we can begin to dig through the WARC file to find some patterns.

We begin by setting the stream position back to 1.

SetStreamPosition[wFile, 1]; (** set the position at 1 **)

We are now back at the beginning of the WARC file. Let’s now grab all the “host” values, for all records that contain that field.

hosts = Reap[
   While[(tempRecord = 
       Read[wFile, Record, RecordSeparators -> {"WARC/1.0"}]) =!= 
     EndOfFile, 
    tempRecordAssociation = recordToAssociation[tempRecord];
    Sow[tempRecordAssociation["Host"]];
    ]];

And now let’s tally and get a summary:

{{"v7.lscache3.c.youtube.com", 2}, {"www.davidsuzuki.org", 
  1689}, {"www.equalvoice.ca", 4644}, {"www.liberal.ca", 
  1968}, {"www.canadiancrc.com", 154}, {"www.greenparty.ca", 
  15}, {"greenparty.ca", 869}, {"www.ndp.ca", 
  447}, {"www.fairvote.ca", 465}, {"www.policyalternatives.ca", 
  596}, {"podcast.cbc.ca", 1}, {"farm3.static.flickr.com", 
  4}, {"youtube.com", 7}, {"img.youtube.com", 7}, {"images.ctv.ca", 
  1}, {"www.partivert.ca", 22}, {"www.gca.ca", 
  2}, {"www.communitywalk.com", 2}, {"www.flickr.com", 
  1}, {"vimeo.com", 2}, {"www.oee.nrcan.gc.ca", 
  2}, {"naturechallenge.org", 2}, {"e-activist.com", 
  2}, {"www.e-activist.com", 2}, {"www.naturechallenge.org", 
  1}, {"www.youtube.com", 2}, {"v18.lscache5.c.youtube.com", 
  2}, {"xfer.ndp.ca", 1}, {"www.cbs.com", 
  1}, {"v2.cache7.c.youtube.com", 1}}

Or, do something zany like a Pie Chart.

PieChart[hostfrequency[[All, 2]], 
 ChartLabels -> hostfrequency[[All, 1]]]

Screen Shot 2016-09-01 at 3.34.17 PM

Step Three: Exploring Content

The metadata fields aren’t present in every record, but CONTENT is usually there. You’ll notice the following command looks similar to above.

SetStreamPosition[wFile, 1];
content = Reap[
   While[(tempRecord = 
       Read[wFile, Record, RecordSeparators -> {"WARC/1.0"}]) =!= 
     EndOfFile, 
    tempRecordAssociation = recordToAssociation[tempRecord];
    Sow[tempRecordAssociation["CONTENT"]];
    ]];

The results are messy: we have HTML tags, and lots of junk data within the content field. But let’s clean it all out.

First, remove HTML tags:

notags = StringReplace[Flatten[content[[2]]], 
   "<" ~~ Except[">"] .. ~~ ">" -> ""];

Then let’s remove all words containing &amp;, /, /:/, and /+/.

cleaned = 
  Select[words, 
   StringFreeQ[#, {___ ~~ "&" ~~ ___, ___ ~~ "/" ~~ ___, ___ ~~ 
       "@" ~~ ___, ___ ~~ ":" ~~ ___, ___ ~~ "+" ~~ ___}] &];

Finally, remove all one-character words and put into lowercase:

lowerclean = ToLowerCase[Select[cleaned, StringLength[#] > 1 &]];

The above steps can all be combined in one command:

lowerclean = 
  ToLowerCase[
   Select[Select[
     Flatten[StringSplit[
       StringReplace[Flatten[content[[2]]], 
        "<" ~~ Except[">"] .. ~~ ">" -> ""]]], 
     StringFreeQ[#, {___ ~~ "&" ~~ ___, ___ ~~ "/" ~~ ___, ___ ~~ 
         "@" ~~ ___, ___ ~~ ":" ~~ ___, ___ ~~ "+" ~~ ___}] &], 
    StringLength[#] > 1 &]];

Now, we have the plain text word frequency. We can get a sense of what’s in a WARC file by deleting stopwords and using the baked in WordCloud command.

Screen Shot 2016-09-01 at 3.36.40 PM

Step Four: Profit?

Now that you’ve got data into Mathematica, suddenly the opportunities are there. Want to do some light-weight NER, similar to what I did with the NDP and Conservative a few months ago? Extract locations? Sentiment analysis? Semantic Interpretation?

For example, if we semantically read the input strings and find Ontario, we can do things with that: tally locations, have country data, population data, etc. etc. Suddenly a new frontier is here.

Semantically Interpreting a WARC record

Semantically Interpreting a WARC record

It’s another tool in the toolkit, and I can’t wait to see what we can do with this.

2 thoughts on “Reading WARC Records with Mathematica

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s