Penn State Warcbase Workshop

Getting Started

Let’s start by opening up the spark-shell again. You’ll do so by following the steps in the setup guide.

In my case, I go to the spark-shell directory

cd spark-1.6.1-bin-hadoop-2.6

And then I run the following to start it, in this case making sure to change the path so that it actually point sat the warcbase-core-0.1.0-SNAPSHOT-fatjar.jar file.

./bin/spark-shell --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

Our First Script

Let's re-run the script from the installation, just to make sure we can get things working.

Enter

:paste

And now paste the script at right, making sure to change the path so that you’re correctly finding the warcbase-core directory.

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/home/ubuntu/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

We should get these results:

r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

What that is doing is counting the domains present in this example ARC file downloaded from the Internet Archive.

Trying it on a real file

Now I’d like you to download this file. It’s only 100MB and will be our sample dataset.

Remember where you’ve downloaded it.

Now let’s try our script on that again

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

You should get

r: Array[(String, Int)] = Array((www.equalvoice.ca,4629), (www.liberal.ca,1968), (greenparty.ca,693), (www.policyalternatives.ca,584), (www.fairvote.ca,465), (www.ndp.ca,415), (www.davidsuzuki.org,362), (www.canadiancrc.com,83), (youtube.com,7), (www.flickr.com,1))

OK! Let’s try saving that to a file, just to get used to working that way. It’ll be the same file, except instead of just showing the top 10, we’ll save to a text file.

What you’ll be doing here is setting up a directory to hold your results. Importantly, the directory cannot already exist – warcbase wants to create it for you!

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.saveAsTextFile("/home/i2millig/warcbase-results/domains")

Now find where you saved it. You will now see a file called part-00000. If we were running this on a cluster, we would have many of these part files, and we’d probably combine them together. Since we’re running locally, you just have one.

Rename it to TORONTO-227-Domains.txt. When you open it, it should read:

(www.equalvoice.ca,4629)
(www.liberal.ca,1968)
(greenparty.ca,693)
(www.policyalternatives.ca,584)
(www.fairvote.ca,465)
(www.ndp.ca,415)
(www.davidsuzuki.org,362)
(www.canadiancrc.com,83)
(youtube.com,7)
(www.flickr.com,1)
(www.communitywalk.com,1)
(v7.lscache3.c.youtube.com,1)
(www.gca.ca,1)
(www.youtube.com,1)
(v18.lscache5.c.youtube.com,1)

More Sophisticated Derivatives

Let’s extract some plain text.

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}

RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/home/i2millig/warcbase-results/text")

Rename this as TORONTO-227-Text.csv. When you open it, you’ll see the plain text of the WARCs. Now we can do some additional filtering if we want. Imagine we just want the equalvoice domains.

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}

RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.keepDomains(Set("www.equalvoice.ca"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/home/i2millig/warcbase-results/text/equalvoice")

Rename this as TORONTO-227-EqualVoice.csv. Now you have the plain text of one domain!

Take a look at the other options here. Try one out!

Now let’s try doing a network diagram.

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._

val links = RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)

links.saveAsTextFile("/home/i2millig/warcbase-results/links/all")

Check out the file, which you can rename as TORONTO-227-Links.txt. The first few lines should look like this:

((liberal.ca,liberal.ca),103668)
((equalvoice.ca,equalvoice.ca),17036)
((ndp.ca,ndp.ca),15311)
((davidsuzuki.org,davidsuzuki.org),13903)
((greenparty.ca,greenparty.ca),9715)
((equalvoice.ca,gettingtothegate.com),4259)
((equalvoice.ca,thestar.com),4259)
((equalvoice.ca,snapdesign.ca),4259)
((liberal.ca,twitter.com),3936)
((canadiancrc.com,translate.google.com),2584)

Let’s run it again, this time saving it as a Gephi file. Unlike the other ones, this will actually generate a single file.

import org.warcbase.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader, WriteGDF}
import org.warcbase.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("/home/i2millig/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)

WriteGDF(links, "/home/i2millig/warcbase-results/links/all-links.gdf")

Save this somewhere you want to keep it, we’ll come back to it in a bit.

Free Play

Now I’d like you to play around with the documents. Anything else you want to explore? I’ll circulate.

Working with your own data

Now let’s try it with your own data. The first thing to learn is that any file that you point at can also be a directory! You can do *.warc.gz or *.gz, as it andles ARCs and WARC files interchangably.

What I’d like you to do is try to do the following:
– create a domain file for your collection;
– create a link file for your collection;
– create a GDF file for your collection;
– create a plain text file for your collection;
– and create a more specialized file for your collection (using some of the filtering tools that Warcbase has).

Gephi

Let’s start with Gephi, and we’ll use the file you generated from the sample data.

Open up Gephi. You’ll see the following:

Screen Shot 2017-04-21 at 6.13.24 PM

What I’d like you to is select Open Graph File, and then select the GDF file you created. In the case above, it is all-links.gdf.

Screen Shot 2017-04-21 at 6.14.23 PM

You’ll see this. Just click through to create a new graph. Press OK. Now you’ll see this:

Screen Shot 2017-04-21 at 6.15.17 PM

Hurray. Now let’s make this meaningful. In my case, some of my windows have gone askew so I go to Window -> Reset Windows. Now it looks like this:

Screen Shot 2017-04-21 at 6.17.06 PM

Screen Shot 2017-04-21 at 6.17.09 PM

Let’s run some basic statistics. Click on “Statistics” on the right hand side, and then RUN modularity and PageRank.

Screen Shot 2017-04-21 at 6.18.23 PM

Now let’s use those. On the left hand side, select Appearance. Go to NODES (those are the domains in this graph). Let’s make them bigger based on how often somebody links to them.

So click “Ranking,” click the little growing set of circles, and then select “Degree.” Let’s set a min size of 5, and a max size of 20. It’ll look like this.

Screen Shot 2017-04-21 at 6.20.41 PM

Now we press “Apply” and the graph changes to this:

Screen Shot 2017-04-21 at 6.21.19 PM

Now let’s make it so the labels appear. Click the “T” in the bottom toolbar. You’ll see a jumble!

Screen Shot 2017-04-21 at 6.22.10 PM

Screen Shot 2017-04-21 at 6.22.13 PM

What a joke! How can we make this intelligible?

We return to the “appearance” tab and select the tT button. Set it up like so.

Screen Shot 2017-04-21 at 6.24.10 PM

Now let’s add some colour. I won’t explain modularity here (I will in the workshop) but let’s colour the nodes based on their modularity class. Do this like so:

Screen Shot 2017-04-21 at 6.25.03 PM

Our graph now looks like this:

Screen Shot 2017-04-21 at 6.25.53 PM

Finally, let’s do some arranging. First of all, let’s futz with the layout (although this one is pretty good by default, to be honest). First, try “Label Adjust.” Run it so the labels look good. Now, I decided to run a standard Yifan Hu, and we see the disconnected components of the graph. This is what happens when you do that.

Screen Shot 2017-04-21 at 6.27.33 PM

OK. Now that we know the ropes, let’s try it on your own collections.