If you already have Warcbase installed, check out the guide here!
Installing Warcbase
OS X
On OS X, requires dependencies. You should begin by installing homebrew. On the command line, you can then run the following:
brew install git brew install maven brew cask install java
Now we need to set JAVA_HOME. On my machine, it is
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home
You might want to test to see where yours is by typing
ls /Library/Java/JavaVirtualMachines/
And then take the resulting version and plug it into your JAVA_HOME
variable, such as if you saw jdk1.8.0_131.jdk
in the list, typing:export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
.
Now with Git, Maven, and JAVA_HOME set, you want to install Warcbase. Clone the repo by:
git clone http://github.com/lintool/warcbase.git
Now go to the warcbase directory (cd warcbase
) and build just warcbase-core:
mvn clean package -pl warcbase-core -DskipTests
Now we need to install “Spark Shell” so we can interface with Warcbase. As of April 20th, install this version in a different directory.
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
and then un-tar it using:
tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
We are almost ready! Change to the Spark-Shell directory, and then run this following command (make sure to point the —jars at the warcbase jar).
./bin/spark-shell --jars ~/penn/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
You should now be at a prompt.
Type :paste
and press enter.
Now paste the following script and then press Ctrl+D. Make sure to change the path appropriately!
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/Users/ianmilligan1/penn/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc) .keepValidPages() .map(r => ExtractDomain(r.getUrl)) .countItems() .take(10)
Success?
Now you’re ready to move on to the guide.
Linux
This is similar to above, but with some differences.
First, install git:
sudo apt-get install git
Download Apache Maven manually
wget http://apache.mirror.gtcomm.net/maven/maven-3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz
and untar
tar -xvf apache-maven-3.5.0-bin.tar.gz
Now we need to set maven variables so we can use it from anywhere
export M2_HOME=/home/ubuntu/apache-maven-3.5.0 export M2=$M2_HOME/bin export PATH=$M2:$PATH
You may need to change your paths accordingly (in M2_HOME).
Now we need to install Java. The following works for Ubuntu 14:
sudo apt-get install openjdk-7-jdk export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
And the following for Ubuntu 16:
sudo add-apt-repository ppa:openjdk-r/ppa sudo apt-get update sudo apt-get install openjdk-7-jdk export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Phew, now we’re ready to install warcbase. Go back to where you want to now install Warcbase.
git clone http://github.com/lintool/warcbase.git
Now let’s build
cd warcbase mvn clean package -pl warcbase-core -DskipTests
And now we can build Spark shell! As of April 20th, install this version in a different directory.
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
and then un-tar it using:
tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
We are almost ready! Change to the Spark-Shell directory, and then run this following command (make sure to point the —jars at the warcbase jar).
./bin/spark-shell --jars ~/penn/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
You should now be at a prompt.
Type :paste
and press enter.
Now paste the following script and then press Ctrl+D. Make sure to change the path appropriately!
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/Users/ianmilligan1/penn/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc) .keepValidPages() .map(r => ExtractDomain(r.getUrl)) .countItems() .take(10)
Success?