Exploring the GeoCities Web Archive with Warcbase & Spark: Links (or how we can use warcbase to find amazing sites to ask historical questions!)

pretty

Not just spaghetti and meatballs, but the starting point for research.

In my last post, we left off with scripts running to extract all URLs and a link diagram. They finished decently quickly – about three days on our rho server at York University, or about 30 minutes on our roaringly-fast cluster. Given that hopefully we will be running these only once or twice at first, even three days isn’t too bad.

This is a relatively technology-heavy post, but I think even relative newcomers to digital history might find this interesting as an entree into several tools that we might use to extract meaningful historical information from big datasets.

Teaser Trailer

Screen Shot 2016-02-23 at 10.07.36 AMI use warcbase to eventually find us significant pages that are both hubs of community and hint towards a protest movement in GeoCities.

Want to learn how? Read on…

Generating the Data

So we ran this to grab the links using warcbase.

import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc(warcs, sc)
.keepValidPages()
.keepDomains(Set("geocities.com))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f =""; (r._1, f._1, f._2)))
.filter(r => r._2 !="" && r._3 != "")
.countItems()
.saveAsTextFile("all-links.txt")

This generated a big file like so:

((20090903,http://geocities.com/saganaki2000/ADSLGR/adslgr.htm,http://www.adslgr.com),15337)
((20091026,http://geocities.com/saganaki2000/ADSLGR/adslgr.htm,http://www.adslgr.com),15337)
((20091027,http://geocities.com/spankbank69hard/,http://pg.photos.yahoo.com/ph/spankbank69hard/my_photos/),9807)
((20090903,http://geocities.com/spankbank69hard/index.html,http://pg.photos.yahoo.com/ph/spankbank69hard/my_photos/),9807)
((20091027,http://geocities.com/CollegePark/Locker/8187/,http://www.comercialuruapan.com),8056)
((20090903,http://geocities.com/CollegePark/Locker/8187/,http://www.comercialuruapan.com),8056)

It was so big, actually, it was about 403GB or so. This isn’t too surprising, given there are over 186 million HTML documents in this collection. But it’s big enough that even counting how many links are in 403GB of links is a time consuming job.

So what can we do to make this play with a network analysis suite?

Wrangling the Data to get into Gephi: The EnchantedForest Case Study

Luckily, I don’t actually have immediate plans to work with every single link in GeoCities. I’m interested in the EnchantedForest, a “neighbourhood” designed by kids for kids and about kids. Ideally, I’d like to write a historical article on this.

So let’s do some parsing to get that 403GB down to something we can analyze.

First, let’s select all links that contain the string “enchantedforest” in the URL. We’ll be case insensitive. This wasn’t a trivial job, taking most of the day on our server, but if I was running this on the cluster I could have used hadoop grep. The command to do this was:

cat all-links.txt | grep -i "enchantedforest" > enchantedforest-links.txt

This took us down to about 2.9GB. Still big, but not unduly so. The next step was to create entities. Consider that individual websites have lots of pages, say like:

http://www.geocities.com/EnchantedForest/Grove/1234/index.html
http://www.geocities.com/EnchantedForest/Grove/1234/pets/cats.html
http://www.geocities.com/EnchantedForest/Grove/1234/pets/dogs.html
http://www.geocities.com/EnchantedForest/Grove/1234/pets/rabbits.html

I imagine this is what the site would look like.

(Note that I’ve described basically the coolest site around)

But in this case, I’d like to treat all of those URLs as just one ‘page’ – in this case, http://www.geocities.com/Enchantedforest/Grove/1234.

Luckily, GeoCities pages that I’m interested in within this neighbourhood all had a four digit number which closed things off. I used a regex to find those and bring them together.

sed 's/[()]*//g; s/^[^,]*,//; s/\([0-9]\{4\}\)[^,]*/\1/g' enchantedforest-links.txt > enchantedforest-entities-cleaned1.txt

This led to the above becoming:

http://geocities.com/EnchantedForest/3575,http://www.guidezone.skl.com/ashes.htm,196
http://geocities.com/EnchantedForest/3575,http://www.guidezone.skl.com/ashes.htm,196
http://geocities.com/EnchantedForest/6710,http:,120
http://geocities.com/EnchantedForest/6710,http:,120
http://geocities.com/EnchantedForest/Creek/7280,http:,120
http://geocities.com/EnchantedForest/Creek/7280,http:,120

Note there’s lots of broken junk in there, such as broken links like http: with no link attached. This seemed to be a common artefact of some GeoCities guestbooks, which I’ll have to look into more.

Finally, I didn’t necessarily want all links – but rather the ones within the EnchantedForest, as well as those going to other GeoCities neighbourhoods. Later on, I will be interested in where they’re linking to outside GeoCities, but not right now.

One last regex finds lines that have two four-digit combinations. This will bring in a trivial number of non-GeoCities links that also have digit-heavy URLs, but not too many.

grep -P '(.*/[0-9]{4}){2}' enchantedforest-entities-cleaned1.txt > enchantedforest-entities-internal.txt

Now we’re ready to import to Gephi!

Gephi Wrangling and Historical Discoveries!

Importing into Gephi is pretty easy. Using vim I added in Source,Target,Weight above the output of enchantedforest-entities-internal.txt so that the first few lines looked like:

Source,Target,Weight
http://www.geocities.com/EnchantedForest/Meadow/1134,http://www.geocities.com/EnchantedForest/1004,83
http://www.geocities.com/EnchantedForest/Meadow/1134,http://www.geocities.com/EnchantedForest/1004,83
http://www.geocities.com/Area51/Stargate/1357,http://www.geocities.com/Area51/EnchantedForest/4213,33
http://www.geocities.com/Area51/Stargate/1357,http://www.geocities.com/Area51/EnchantedForest/4213,33
http://www.geocities.com/Eureka/1309,http://www.geocities.com/EnchantedForest/Tower/7555,27
http://www.geocities.com/Eureka/1309,http://www.geocities.com/EnchantedForest/Tower/7555,27

In Gephi, I imported the spreadsheet, and began to lay out the network diagram. I’m not going into data, but it was similar to the approach taken with our released derivative data. Here it is, all arranged with nodes bigger the more people were likely to stumble upon their site by following links (PageRank):

Screen Shot 2016-02-23 at 10.07.09 AM

Note one of the prominent URLs, the EnchantedForest/Glade/3891. Let’s check it out. Here’s what it looked like at the height of GeoCities:

Screen Shot 2016-02-23 at 10.07.28 AM

Cool! An awards site. This is the sort of connective tissue that stitched GeoCities communities together (or so I argue in a forthcoming piece), and it’s great to see it front-and-centre here. Many sites would have linked to this, as an awards page was a staple of many – around one in five, as a back of napkin calculation – homepages in this neighbourhood.

And if we flash forward, here’s what the site looked like in 2009:

Screen Shot 2016-02-23 at 10.07.36 AM

Interesting – a protest against Yahoo! closing it down. I can already think of several historical uses. This page alone speaks to several things:

  • The prevalence of awards pages and awards hubs within this neighbourhood;
  • A protest movement that may have emerged when Yahoo! decided to shut down the neighbourhood;
  • We can begin to follow links from this awards page, by highlighting it in Gephi, to find pages that hosted awards in connection with it;

I could imagine a paper coming out of just this site, and plan to cite it and use it in the draft I’m currently writing.

Conclusions

In short, here’s what I’ve done in this post:

  • Taken us with warcbase from a directory of WARC files to a series of files that we could import into Gephi;
  • Plotted the links so that we could explore them ourselves;
  • Found several pages of interest based on their prominence within the GeoCities EnchantedForest;
  • Raised some research questions.

All of this is possible with open-source software that we’re developing, and I really think this shows some promising directions for research with web archives.

 

One thought on “Exploring the GeoCities Web Archive with Warcbase & Spark: Links (or how we can use warcbase to find amazing sites to ask historical questions!)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s