I had the great pleasure to be a speaker at the Ethics and Archiving the Web conference at the New Museum in New York City. My own contribution to the conference was a piece on the “Ethics of Studying GeoCities.”
The livestream of the whole conference is available here.
Hi everybody and thanks so much for coming to my talk today. What I want to do is discuss the “ethics of studying GeoCities,” which to me gets at both the potential but also the risks of doing a lot of this web archival research.
The utopian potential of web archives for me can be summed up in something as innocuous as a personal homepage, hosted on the free GeoCities.com service. GeoCities.com, founded in 1994, provided free websites to anybody who wanted to create one. A user would visit GeoCities.com, enter their e-mail address, and receive a free megabyte to stake their own space on the burgeoning Information Superhighway. These sites took many shapes and sizes: a Buffy the Vampire Slayer fan site, a celebration of a favourite sports team, a family tree, a diary of an LGBT community members experiences, even a young child’s tribute to Winnie the Pooh. Early Web users flocked.
For most people, the Wayback Machine provides privacy by obscurity. Keyword searching only hits the home pages – i.e. just GeoCities.com rather than the individual pages within the site – and to find content, you’re limited to the link by link explorations of a user from 1997.
But if you can get the underlying WARC files, or access the Archive Team torrent, it is all there. Rather than going page by page, you now suddenly have hundreds of millions of files to access in all their glory.
In any large web archive, given the sheer amount of data that becomes accessible to researchers, much of the onus is going to have to fall on the scholar themselves to navigate this material. University IRBs often consider web archives “publications,” and in any case, the scale brings major problems.
We know this from Twitter too – although it is legal to quote from tweets, blogs, or websites, though that does not necessarily make it ethical to pop into your New York Times article or on the front page of your highly-trafficked academic blog (hat tip to Aaron Bady on this point).
We often do not have the real-world name of a person: e-mail addresses are mostly defunct after fifteen years of disuse, and online aliases are generally not people’s real-world names, although if I had more time, we could talk about how that’s really changed since GeoCities where people had a less different vision of privacy. While we can occasionally track down the identity of an author – pathways such as Googling a GeoCities handle and finding a LiveJournal account, and from there a Twitter account and a real name – that feels extremely invasive and would require further ethical clearance. In any case, it does not scale to the millions of sources web archives contain.
I feel similarly uncomfortable with leaving the voices of everyday people completely outside the historical record when there is ample opportunity to include them. Moving to a full opt-in process would likely lead to the historical record being dominated by corporations, celebrities and other powerful people, tech males, and those wanted their public face and history to be seen a particular way.
First, we can use “distant reading” to zoom our gaze away from the individual websites and to look for larger patterns within an archive. Looking at link patterns to ascertain PageRank, exploring topics through topic modelling, looking at thousands of images instead of tens.
None of this by any means eliminates all ethical concerns – think of what the NSA has done with similar kinds of data – but it does mitigate them to some degree. Individuals have their privacy protected, and others cannot find their sites without having the same level of access to the archive. People are obscured, but they are still read into the historical record.
They also speak to a valuable research method, as of course, no matter how diligent or comprehensive a researcher can be, they are not going to be able to read every GeoCities page.
Ultimately and imperfectly, the onus will thus need to fall on individual researchers to carry out a risk assessment. Did the author of the individual website they are citing or using as evidence have a reasonable expectation of privacy? If they were a GeoCities community leader, with inbound links from hundreds of other sites and featured on directories, probably not. If they were posting a heartfelt story on a friend’s GeoCities guestbook, part of a seemingly closed social network of a few high school chums, they probably do.
This means that when engaging with individual sites, the central metric should be “expectation of privacy.”
We have power because we can access the blogs, ruminations, and personal moments of literally millions of people that would never before have been accessed – but we need to use this power responsibly.