Exploring the GeoCities Web Archive with Warcbase & Spark: Getting Started

Nick Ruest and I had some great news a few weeks ago: a collection of GeoCities WARCs was on its way on a few hard drives. I’ve previously done quite a bit of work on the GeoCities torrent, but as we’ve been doing parallel development on warcbase while working with the torrent, it’s been difficult to have one set of tools talk to our earlier dataset. Once we have all the files in WARC format, as we do now, we can use warcbase to generate derivative datasets.

Everybody should win, in theory, as it both helps research into GeoCities, research into warcbase, and research into web archival use more generally.

Step One: Ingesting the Data

Once the hard drives arrived, it was fun to watch the data populate our server as Nick supervised the time-consuming job of moving over 4TB of data from two hard drives onto our server at York University.

Once we had the data, it looked beautiful:

Screen Shot 2016-02-14 at 9.24.09 AM

Now to begin working with it.

Step Two: Early Explorations – Prototyping in Spark Notebook

I’ve got two early-stage goals for getting this data into a format that is usable. First, a full-text index for our research team to access; secondly, a derivative dataset of links for network analysis (and hopefully to be able to release as a research derivative).

The first step was to begin to prototype using our warcbase Spark Notebook interface. This isn’t designed for full-fledged production, but rather to try out scripts to see how they worked. I grabbed a few WARCs down to my local system (using scp) and began testing out some scripts that we’d use.

I extracted some test links using the following script:

RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.keepDomains(Set("geocities.com/"))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (f._1.replaceAll("^\\s*www\\.", ""),f._2.replaceAll("^\\s*www\\.", ""))))
.take(10)

The results were – both exciting and terrifying. One WARC file yielded about 50,000 links. That’s doable in Gephi, but we have 8,897 WARCs, so we’re potentially looking at somewhere in the 300 – 450 million link territory. That’s not going to be fun to play with, or even a dataset that we could easily share.

For network analysis, then, we’re presented with one of our first problems:

  • Previous tests with warcbase have been of big collections with hundreds of domains, so we generally aggregated at the level of domain – a link from liberal.ca to geocities.ca, for example, regardless of what sub-page or even sub-domain it was on;
  • 400 million links is not fun to play with;
  • So we’ll need some other way to aggregate.

URL structures in GeoCities were consistent until 1999, largely looking like: http://geocities.com/[NEIGHBOURHOOD]/[SUBDIVISIONS (optional)]/[FOUR-DIGIT ADDRESS]/content

After 1999, they went to vanity URLs, like http://geocities.com/ianmilligan1.

What we want to do is have it so that a page like http://geocities.com/EnchantedForest/Grove/1234/index.html et al has all of its outbound links aggregated as http://geocities.com/EnchantedForest/Grove/1234. And all of the pages like http://geocities.com/ianmilligan1/ get aggregated as such, and so forth. It’s kind of hard to explain, but we want to lump the index pages, the awards pages, the cat pages all under one Gephi node per complete site.

That said, even if the domains aren’t useful within the GeoCities link structure, they’re useful for the outside. So Nick went to work with our first big job, extracting all of the domain-level links. Did people begin to link to MySpace.com at the end of days? Did various parts of GeoCities link to various domains around the Web? We’ll find out soon.

import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("/mnt/vol1/data_sets/geocities/warcs/*", sc)
.keepValidPages()
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.saveAsTextFile("/mnt/vol1/data_sets/geocities/geocities.sitelinks")

To be sure we’re not missing patterns, though, we’re going to need a dump of every single URL in GeoCities. This brings us to our second big job.

Step Three: All URL List

This is where we’re at now. We have the WARCs, we know we want to somehow transform URLs to create nodes out of individual websites (as opposed to every individual component within each website).

We now switch to Spark-Shell (using ./bin/spark-shell --driver-memory 40G --jars ~/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar), and run this command:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadWarc("/mnt/vol1/data_sets/geocities/warcs", sc)
.keepValidPages()
.map(r => r.getUrl)
.saveAsTextFile("/mnt/vol1/derivative_data/geocities/url-list")

Now it becomes a tedious exercise in tweaking memory settings and fighting for heap space, so I’ll leave it at that. There’s a reason I woke up early on Sunday to bring this together: this is going to take awhile.

Stay tuned for more explorations.

Screen Shot 2016-02-14 at 9.53.30 AM

Current Status: Waiting (and drinking coffee)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s