Extracting Links from GeoCities and Throwing Them at the Wall

I have been exploring link structures as part of my play with web archives. My sense is that they’re a useful addition to the other tools that we use to analyze WARC files and other collections. The difficulty is, of course, extracting links from the various formats that we find web archives in.

What We Can Learn?

We can get a sense, at a glance, what the structure of the website was; and, more importantly, with lots of links we can get a sense of how cohesive a community was. For example, with GeoCities, I am interested in seeing what percentage of links were inbound – or to other GeoCities pages – or external, in that they went elsewhere. This can help tackle the big question of whether GeoCities neighbourhoods could be understood as ‘communities’ or were just a place to park your website for a while in the late 1990s. As Rob Warren pointed out to me, network visualizations aren’t always the ideal case in this – sometimes a simple grep can do.

I create link visualizations in Mathematica by adapting the code found here. It follows links to a specified depth and renders them. I then adapt the code using a discussion found on this StackOverflow page from 2011 which adds tooltips to edges and nodes (OK, I asked that question like three years ago). Continue reading

Image File Extensions in the Wide Web Scrape

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

Percent of the three leading file formats in various TLDs. i.e. In China, JPGs are almost 90% of image extensions, in .gov they are 50%.

A bit of an aside post. I’ve been playing with extracting images from WARC files so I can try to play with some computer vision techniques on them. One part of extracting images was that I needed to know which file extensions to look for. While I began with the usual suspects (initially JP*Gs), after some crowdsourcing and being pointed towards Andy Jackson’s “Formats Over Time: Exploring UK Web History” I came up with a good list of extensions to extract.

These are from the 2011 Wide Web Scrape web archive.

The following regular expression, '.+\.TLD/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)' where TLD represents the top-level domain I am searching for (i.e. CA or COM) helped find the extensions that I am looking for. A variation of this script extracted images and numbered them, preserving duplicates and identically-named files.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

File formats in the .ca top-level domain. JPG dominates, followed by GIF and PNG.

The findings were interesting in terms of what top-level domain contained what. I’ll paste my findings below in case they’re interesting. It can help you know what to look for, and gives a sense of the web c. 2011. The next step is to analyze these images, of course.. Continue reading

Getting Gephi Running on OS X Mavericks

Network of Texan Correspondence, a Historian's Macroscope Example.

Network of Texan Correspondence, a Historian’s Macroscope Example.

I’ve been using Gephi as I playtest through one of our final drafts of the Historian’s Macroscope – which has returned me to face one of my old nemesis.

Just because this has been a constant pain in the you know what, I wanted to post how I got it working in case it helps a fellow DHer. I suspect it’s a conflict with my JDK, as I know folks have trouble if they’ve been playing with MALLET and then switching over to Gephi. Anyways, it’s been constantly hanging (on the data laboratory screen), but only today did I actually force myself to fix it.

Open up /applications/gephi.app/contents/resources/gephi/etc/gephi.conf. On the command line, you can just run that with vim, or you can navigate to it yourself and open it up with your regular text editor.

[edited to add: Jonathan Goodwin pointed out that vim is kinda hard to learn - what I would recommend would be either: (a) use your 'finder' to go to the 'Gephi.app' icon in your 'application folder,' right click and select 'Show Package Contents' and then navigate from there, and then open up 'Gephi.conf' in your TextEdit program; or (b) use the command line to navigate to it, and again, if you don't use vim just type open . and you'll get a finder window there. Good luck!]

Under the line:

# default location of JDK/JRE, can be overridden by using --jdkhome switch

replace the existing value with:


Hat-tip to the Gephi forums for this.

DH 2014 Slides and Talk: “Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource”

I gave a short paper at Digital Humanities 2014 last week. Held on the joint campus of the l’Université de Lausanne and the École polytechnique fédérale de Lausanne, it was a very good time. I might put a quick post up about some of my other observations, notably my visit to CERN, later on.

Here are the slides and text of my short talk. It’s a variation of the paper I also gave at the International Internet Preservation Consortium annual meeting in May. It concludes my conferencing for this summer – I have a workshop on computer vision and history during the Fall semester, and then the American Historical Association’s annual meeting in January. But until then: writing, research, and not flying in a plane.

Note: You should be able to reconstruct much of this approach from various blog posts on this site. See, for example: my WARC-Tools section, my clustering section, my NER posts, regular expression tinkering, and beyond (all my web archiving posts). Happy to chat, too. A downside of blogging so much is that many of will have seen much of this before.

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

DH Abstract available here


Hi everybody. My name is Ian Milligan and I’m an assistant professor of Canadian and digital history at the University of Waterloo. In today’s short talk, I want to provide you with the approach I’ve been using to navigate large amounts of web information that uses open-source tools, provides an ad-hoc finding aid to me, and helps me make sense of the data deluge that historians are soon going to have to grapple with.

Using a case study of data that the Internet Archive released in October 2012, which is an entire scrape of the World Wide Web, I try to figure out what historians need to know to deal with this material. Continue reading

Three Tools for the Web-Savvy Historian: Memento, Zotero, and WebCite

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you'll get.

Over 200,000 citations or references to these websites exist in Google Books, and this is basically what you’ll get when you follow them. There’s no excuse for this anymore.

By Ian Milligan

“Sorry, the page you were looking for is no longer available.” In everyday web browsing, a frustration. In recreating or retracing the steps of a scholarly paper, it’s a potential nightmare. Luckily, three tools exist that users should be using to properly cite, store, and retrieve web information – before it’s too late and the material is gone!

Historians, writers, and users of the Web cite and draw on web-based material every day. Journal articles are replete with cited (and almost certainly uncited) digital material: websites, blogs, online newspapers, all pointing towards URLs. Many of these links will die. I don’t write this to be morbid, but to point out a fact. For example, if we search “http://geocities.com/” in Google Books we receive 247,000 results. Most of those are references to sites hosted on GeoCities that are now dead. If you follow those links, you’ll get the error that the “GeoCities web site you were trying to reach is no longer available.

What can we do? We can use three tools. Memento to retrieve archived web pages from multiple sources, WebCite to properly cite and store archived material, and Zotero to create your own personal database of archived snapshots. Let’s look at them all in turn. Continue reading

ACA 2014 Presentation: “The Great WARC Adventure: WARCs from Creation to Use”

Click through for our slide deck.

Click through for our slide deck.

I’ve been having a great time here in Victoria, BC at the Association of Canadian Archivists’ annual meeting. As a historian, it’s been great learning from archivists: I’ve got a growing file full of journal articles to read, debates to brush up on, and lots of specific tidbits about how born-digital records fit within the archival context.

My contribution to the conference was a paper that I co-presented with Nick Ruest, the Digital Assets Librarian at York University. Our paper, “The Great WARC Adventure: WARCs from creation to use,” focused on the specific case study of the #freedaleaskey collection, a web archive comprised of daily crawls of blogs, websites, and discussion boards pertaining to the Dale Askey libel case.

Our slides are available here, via the York University institutional repository. We’re planning to do some more work on this project, and the slides may not be completely clear without our commentary. The key take home was that these kinds of web archives are pretty cheap and easy to make (this was created using wget), and the daily frequency of the web crawls can allow historians to do some fun longitudinal distant reading. Stay tuned for more on this over the next few months.

Testing Cohesiveness in GeoCities Neighbourhoods by Extracting and Plotting Locations

The 'heartland' of GeoCities?

The ‘heartland’ of GeoCities?

This weekend, I went back to my old GeoCities archive to play around with the methods I experimented with in my last post on the Wide Web Scrape. One question that I’ve been curious about was whether GeoCities was a community (drawing on an old debate that waged in the 1990s and beyond about virtual communities), and how we could see that: in volunteer networks, neighbourhood volunteers, web rings, guestbooks, links to each other, etc. As with all of my blog posts, this is just me jotting a few things down (the lab notebook model), not a fully thought-out peer reviewed piece. But keep reading for the pictures and discussion. :)

GeoCities and Neighbourhoods: An Extremely Short Introduction

Welcome to the Neighbourhood! (December 20th, 1996) - click through for the Wayback Machine version of this page.

Welcome to the Neighbourhood! (December 20th, 1996) – click through for the Wayback Machine version of this page.

Before its 1999 acquisition by Yahoo!, GeoCities was arranged into a set of neigbourhoods. It was presented as a “cityscape,” with streets, street numbers, recognizable urban and geographical landmarks (one might live on a virtual Fifth Avenue or on a festive Bourbon Street), and this was strongly emphasized in press releases and user communications. The central metaphor that governed the admission of new users into GeoCities was that of homesteading. A conscious choice, keeping with the spirit of the frontier so common during the early days of the Web (harkening to its communalist roots), as it captured the then-common heady expansionary rhetoric. Users would need new homes, and these new homes would be located in neighbourhoods. This sat with the visions of GeoCities founders David Bohnett and John Rezner (who joined in August 1995), who saw “[n]eighbourhoods, and the people that live in them, [as providing] the foundation of community.” When selecting sites, users were presented with a list of various places where their site might belong. Those writing about “[e]ducation, literature, poetry, philosophy” would be encouraged to incorporate their site into Athens; political wonks to CapitolHill; small businesspeople or those working from home in Eureka, and beyond. Some neighbourhoods came with restrictions and guidance, such as the more protective and censored EnchantedForest for children. Others were much wider in scope, such as “Heartland” focusing on “families, pets, hometown values.”

I could talk your ear off (that paragraph is a compression of several pages I’ve written) but you should get the broad picture by now. Continue reading