Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago).

In any case, here are two visualizations for information. Below, is a modern WordPress site with labels added in. We can see the interconnected nature of the web – links to everywhere, short hops, etc.

Visualizing the link structure of my WordPress site, manually annotated (using tooltips).

Visualizing the link structure of my WordPress site, manually annotated (using tooltips).

And here’s a GeoCities page taken from the Internet Archive. We see a relatively flat hierarchy. You’d have to use your ‘back’ button a lot once you were in a site, because there weren’t necessarily shortcuts. This seems relatively consistent based on what I’ve seen on other older webpages.

Structure of one GeoCities page.

Structure of one GeoCities page.

Extracting Links from the Archive Team Dump

If you want the code, just scroll down to the bottom of this post and it’s all there.

If you have WARC files, there is a neat tool created by Jimmy Lin at the University of Maryland, warcbase. Unfortunately, I haven’t been able to get it up and running on my Mac system now (luckily, I’ve got a stellar research assistant who’s going to tackle that for me this fall). For more on warcbase, this presentation is extremely useful.

Web archives aren’t always in WARC format. The Archive Team end-of-life crawl for example, available at the Internet Archive here just reproduces the directory structure of HTML and content files on your computer. But this lets us use some other tools.

When decompressed, files are in a format like so on my computer:

/volumes/lacie/geocities-fullscrape/geocities/www.geocities.com/heartland/7792

I navigated to the /volumes/lacie/geocities-fullscrape/geocities directory and ran an egrep command. You have to use find given the quantity of files.

find . -name '*.htm*' | xargs egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" >> geocities-links.txt

This actually ran in about the time it took me to go grab a coffee from our lunch room and have a stretch. The resulting file size was 2,203,804 links. Luckily, they were arranged like so:

http://www.geocities.com/Heartland/Fields/9393/Josh9.html:http://geo.yahoo.com/serv?s=76001068&t=1256472712&f=us-w7
http://www.geocities.com/Heartland/Fields/9393/map.html:http://www.geocities.com/Heartland/Fields/9393/Josh1.html

This is why I ran it in that geocities/fullscrape/geocities directory. On one side of the : we have the origin file, on the right hand of the : we have the destination link. Starting to look like a set of link relations, right?

I loaded it into Mathematica and did a few quick transformations. First, I changed :http to a unique break @@@@@@. This lets us split the list using one command:

rulelist = StringSplit[#, "@@@@@@@"] & /@ newbreaks;

The next step was to make it so that the left side of each line had a rule (->) pointing to the right side. Initally, I tried doing this with a sloppy Do loop and AppendTo, but this would have taken ~ 5 -6 hours. Instead, I finally learned how to reap and sow and it did it in about 5 seconds. Thanks, “Ten Tips for Writing Fast Mathematica Code”! All code is below.

In any case, I now had a set of Mathematica rules (this is critical, they had to be ‘rules’ and not strings) like:

http://www.geocities.com/Heartland/Fields/9393/Josh9.html -> http://www.geocities.com/Heartland/Fields/9393/map.html:http://www.geocities.com/Heartland/Fields/9393/Josh1.html

Now what can we do with them?

Exploring the Links

First step was to dump all 2,203,804 links into a network visualization, blow through my 64GB of RAM, and forcibly reboot my computer. deep breath

Second step was to take it a bit more slowly. Here are some initial findings.

First, a selection of 4,000 links.

4000 links

It’s interesting. We see the ‘fan’ structure of individual sites, around the peripheary. In the center we have big hubs where lots of things linked to. One is a page about downloading a flash viewer, others are popular cliparts that were obviously linked to from many people.

Second, here’s ten thousand links. Again, we see the fan structure of individual sites, as well as a neat exploding death star motif.

10,000 links

10,000 links

Third, here’s 20,000 links. Again, the fan structure, although it’s getting hard to view. We see some massive sites, like in the lower right where one site acocunts for hundreds of pages.

Screen Shot 2014-07-23 at 11.30.44 AM

It Works! So What?

Next step is to start doing regular expressions to pull out neighbourhoods (i.e. Heartland, Athens, etc.), and then do invididual network diagrams of them along those lines. Did they look different? Part of this is for an article/chapter that I am writing, so stay tuned. Anyways, as the title indicated, this was just throwing stuff at the wall. I think some of it will stick for my later work.

Code

  1. Start by running on command line: find . -name '*.htm*' | xargs egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" >> geocities-links.txt

  2. Launch Mathematica. Then:


links = Import["/users/ianmilligan1/dropbox/geocities-link-dump.txt",
"Lines"];
newbreaks = StringReplace[links, ":http" -> "@@@@@@@http"];
rulelist = StringSplit[#, "@@@@@@@"] & /@ newbreaks;
data = Reap[
Do[Sow[rulelist[[x]][[1]] -> rulelist[[x]][[2]]], {x, 1,
2204519}]];
linklist = data[[2]][[1]];
webcrawler[rooturl_, depth_] :=
Flatten[Rest[
NestList[
Union[Flatten[
Thread[# -> Import[#, "Hyperlinks"]] & /@
Last /@ #]] &, {"" -> rooturl}, depth]]];

style = {VertexStyle -> White, VertexShapeFunction -> “Point”,
EdgeStyle -> Directive[Opacity[.5], Hue[.15, .5, .8]],
Background -> Black, EdgeShapeFunction -> (Line[#1] &),
ImageSize -> 500};

Graph[Take[linklist, 20000], EdgeLabels -> Placed[“Name”, Tooltip],
EdgeShapeFunction -> “Line”, VertexLabels -> Placed[“Name”, Tooltip],
EdgeStyle -> {Orange}, VertexSize -> {“Scaled”, 0.007},
ImageSize -> 4000]

You can change the variable in the last command for Take to whatever you want. 20000 in that case takes the first 20,000 links.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s