Warcbase Installation

If you already have Warcbase installed, check out the guide here!

Installing Warcbase

OS X

On OS X, requires dependencies. You should begin by installing homebrew. On the command line, you can then run the following:

brew install git
brew install maven
brew cask install java

Now we need to set JAVA_HOME. On my machine, it is

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home

You might want to test to see where yours is by typing

ls /Library/Java/JavaVirtualMachines/

And then take the resulting version and plug it into your JAVA_HOME variable, such as if you saw jdk1.8.0_131.jdk in the list, typing:export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home.

Now with Git, Maven, and JAVA_HOME set, you want to install Warcbase. Clone the repo by:

git clone http://github.com/lintool/warcbase.git

Now go to the warcbase directory (cd warcbase) and build just warcbase-core:

mvn clean package -pl warcbase-core -DskipTests

Now we need to install “Spark Shell” so we can interface with Warcbase. As of April 20th, install this version in a different directory.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

and then un-tar it using:

tar -xvf spark-1.6.1-bin-hadoop2.6.tgz

We are almost ready! Change to the Spark-Shell directory, and then run this following command (make sure to point the —jars at the warcbase jar).

./bin/spark-shell --jars ~/penn/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

You should now be at a prompt.

Type :paste and press enter.

Now paste the following script and then press Ctrl+D. Make sure to change the path appropriately!

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/Users/ianmilligan1/penn/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

Success?

Now you’re ready to move on to the guide.

Linux

This is similar to above, but with some differences.

First, install git:

sudo apt-get install git

Download Apache Maven manually

wget http://apache.mirror.gtcomm.net/maven/maven-3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz

and untar

tar -xvf apache-maven-3.5.0-bin.tar.gz

Now we need to set maven variables so we can use it from anywhere

export M2_HOME=/home/ubuntu/apache-maven-3.5.0
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

You may need to change your paths accordingly (in M2_HOME).

Now we need to install Java. The following works for Ubuntu 14:

sudo apt-get install openjdk-7-jdk
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

And the following for Ubuntu 16:

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-7-jdk
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Phew, now we’re ready to install warcbase. Go back to where you want to now install Warcbase.

git clone http://github.com/lintool/warcbase.git

Now let’s build

cd warcbase
mvn clean package -pl warcbase-core -DskipTests

And now we can build Spark shell! As of April 20th, install this version in a different directory.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

and then un-tar it using:

tar -xvf spark-1.6.1-bin-hadoop2.6.tgz

We are almost ready! Change to the Spark-Shell directory, and then run this following command (make sure to point the —jars at the warcbase jar).

./bin/spark-shell --jars ~/penn/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

You should now be at a prompt.

Type :paste and press enter.

Now paste the following script and then press Ctrl+D. Make sure to change the path appropriately!

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/Users/ianmilligan1/penn/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

Success?

Now you’re ready to move on to the guide.