Here is the rough text of what I’ll be presenting at the CHA. I tend to ad-lib a bit, but this should give you a sense of what I’m presenting on. It’s a twenty minute timeslot to a general audience. As I noted earlier, there is a full-text paper that drives the presentation. If you want it, drop me a line.
Hi everybody and thank you for coming to my talk, “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive files.” I want to begin with something that I think we’ll all find familiar.
Archival boxes come in different shapes and sizes, but are a familiar sight to historians. Generations of thought and design have gone into these boxes: they are specifically designed to protect documents over a long time period, reduce acidity, and they can withstand considerable physical wear and tear to avoid having to replace them. No fewer than six International Standards Organizations (ISO) specifications go into the creation and maintenance of physical archives. Historians played a large role in the establishment of the archival profession, a voice that has been supplanted in recent years by the rise of library and information schools.
Historians need to understand the implications of the arrival of new archival boxes: web archives. These necessitate a rethinking of how we approach our professional standards and training, with particular implications for historians studying topics involving the 1980s onwards. In this presentation, I have two main objectives: to firstly introduce historians to the main issues of web archives, with an eye to incorporating them into our professional and pedagogical practices. Secondly, I argue that we need to look at various workflows that can help us open our own next generation archival boxes. There is a literature on this stuff, but it’s aimed as an internal conversation. I want historians to make sure we’re at the table and part of the scholarly conversation.
Historians are already able to reap the fruits of forward thinking digital preservationists and archivists. To see why, we need to go back to the mid-1990s. Then, we began to be worried about our digital heritage: digital records were more numerous, images and text were beginning to be lumped together into one file, physical storage was evolving as floppies gave ways to CD-ROMS, and they all had differing standards of longevity and accessibility.
Enter Brewster Kahle, an Internet enterpreneur and visionary. He had a simple goal: to download and preserve the entire publicly accessible World Wide Web. He founded the Internet Archive in 1996. They programmed web crawlers, automated software programs, that would go out and download content. It was a potentially infinite process, depending on how the Web developed. The crawler visited a site, downloaded it, and then followed each link on the page. At each page it then visited, it would download the site, follow those links, and so forth. As the World Wide Web continually grows and changes, a crawler could potentially archive forever.
In October 2001, the WaybackMachine launched – the first website demonstrated being a Clinton-era press release about airline security, and a new era of historical research was upon us.
But what does this mean for historians? Web archives present considerable opportunity and challenge to historians due to their sheer size, the particular technical challenges posed by the dynamic and interconnected nature of the World Wide Web, and the ethical dimensions that may arise.
Kryder’s Law has made this possible. Building off the more popularly-understood Moore’s Law of ever-increasing transistor numbers of microchips, which foresaw a doubling of density every two years, Kryder’s law holds that storage density will double approximately every eleven months. In 2011 alone, we created 1.8 zettabytes of information, or 1.8 trillion gigabytes.
While much of this data is being produced in the form of private sector databases, automated logs, and proprietary and walled-off parts of the web such as Facebook, if even a fraction of this data store is made available to historians in the future it will represent an unparalleled resource. Consider just two examples: YouTube sees 72 hours of video uploaded every single minute, and Twitter sees roughly 200 million tweets per day. This is an archive that we can never fully grasp, as it continues to infinitely grow. It is different every single day.
Size comes with problems, though, because it can be a bit illusionary. Websites are archived at differing intervals, depending on the site’s significance and traffic. Consider the following example that illustrates frequency of Globeandmail.com scrapes by the Internet Archive.
There are two major spikes. The first follows the events of 9/11. Demonstrating foresight, the Internet Archive preserved a considerable digital footprint of the events of the day: we can now trace how a web visitor might have experienced some of the events and the tumultuous weeks that followed. As the web was then becoming the primary delivery mechanism for news, this is now an unparalleled way to explore how stories developed, how misinformation arose and was subsequently corrected, and overall provides insight into how people experienced 9/11.
But it is critical to note, through this example, that we are not preserving a complete, unfettered record of the Internet’s past. There is a lot of loss. This will pose methodological problems for future historians, just as contemporary academics struggle with accessing early television broadcasts. News media is increasingly consumed online, breaking news is experienced there, and stories change throughout the day. Historians will still have to largely rely on microfilmed or digitized archived print versions of newspapers. As historians, we should be pressing for better institutional archiving. Moreover, we will have to rethink how we approach histories of the period when we cannot approach news media as it was consumed.
It is worth pausing to briefly consider some of the unique technical challenges at play. Consider the website ActiveHistory.ca, a popular Canadian historical group blog. It is a simple webpage, approximately four years old. If you were to download the webpage to host on your own computer, you would download almost two gigabytes of information spread across 18,793 files. Imagine the difficulties in archiving this material: from storage, to preservation, and ultimately to making it usable for historians.
To use an archival metaphor, there is no simple sheet entitled “ActiveHistory.ca”. It is instead constructed from pieces all over the World Wide Web, a complex interconnected, ever-living and changing document. If one simply archives a single page, if it relies upon external content or images they have to be archived as well or a long-term, accurate representation has not been preserved.
Finally, web archives bring with them new ethical dimensions that have not yet been fully explored. It is largely uncharted territory. With traditional archives, donor agreements ideally lay out restrictions, if any, on the use of material. Oral historians have a whole host of literature and regulations to draw on.
How to approach websites? We do not have a similar body of practice. Roy Rosenzweig laid out an interesting example of the transitory nature of the World Wide Web in a 2003 American Historical Review article. He uses the “Bert is Evil” website, an example of an early Internet meme which saw the Sesame Street character Bert posed with nefarious individuals such as Adolf Hitler and Osama Bin Laden, as a key example of the issues facing historians. After 9/11 and the beginning of the global War on Terror, a print shop manager in Dhaka, Bangladesh used images of Osama Bin Laden from a website.
Bert was tucked away in the corner. After he showed up in American broadcasts connected to Bin Laden, legal threats followed, and Bert is Evil’s owner decided to pull the site down on 11 October 2001. As Rosenzweig explains, this is scary: “If Ignacio had published his satire in a book or magazine, it would sit on thousands of library shelves rather than having a more fugitive existence as magnetic impulses on a web server.” People publish things in newspapers all the time, that they might regret.
It also raises ethical questions. Despite Ignacio’s statement above, imploring his fans and mirrors (people who copied the site to provide it) to stop sharing Bert is Evil, the Internet Archive preserved the site. Do we, as historians, have an ethical obligation to those who upload websites to respect their wishes? Or does it constitute published material, in which case the author has few rights on the fair use or fair dealing rights of a researcher? This is a relatively uncharted historical area.
The Internet Archive grappled with these questions in its early inception. One of their initial proposals was to handle web information “like census data — aggregate information is made public, but specific information about individuals is kept confidential,” but by 2001 all information was made available. While you can opt out of the Internet Archive (both currently and retroactively) by modifying the robots.txt file on a website’s server, by default websites are included.
There are no easy answers. Should Internet comments or discussions be fair game, akin to published material or letters to the editor? Are submissions to online discussion boards to be accessed and viewed unfettered? What about a high school student website? Analog equivalents to these digital examples are not always straightforward: letters to government, or submissions, are occasionally censored when it comes to archival access. There are two key dimensions: the scale of study, whereas aggregate information (textual analysis of hundreds or thousands of blogs or tweets, for example) is almost certainly permissable whereas we may have to think about risk analyses for others.
I just want to make sure that historians get in on the ground floor here.
So, some brief technical notes in the short amount of time we have left here.
The Internet Archive’s WaybackMachine is and most likely will continue to be the most common way for historians to access web archives. As the WaybackMachine does not offer full-text functionality, the user needs to know the right Uniform Resource Locator (URL), such as http://google.com. This can be tricky as historical URLs are not always readily apparent today.
Luckily, we have a few options as historians. First, as I have done in several other research projects, we can combine traditional archival research with the WaybackMachine. Find a URL in an archival document or historical newspaper, plug it into the WaybackMachine, and the content is found. Second, before the advent of dynamic search engines like Google, users relied on web directories. These have largely been preserved. A user can go to Yahoo.com as of 1997, and use the directory of listed websites to find relevant sources.
Using the system can require more complicated workflows, and in the next few paragraphs I want to provide an example of how it can be used. Early Internet forums are invaluable resources, representing large amounts of non-commercialized public speech. In 1998, the Canadian regulatory body for telecommunications, encompassing radio, telephone, and television, was considering whether New Media would fall under its purview. It accordingly held a series of public consultations across the country. The CRTC needed people to know where the website was, and thus placed advertisements in newspapers containing links to the websites.
With the forum pages found, we can use a workflow that helps harness the best of born-digital resources within the WaybackMachine without the attendant drawbacks. Consider the website below:
Drawing on free, open-source tools such as the Programming Historian 2, I was able to download every single discussion page and save them as plain text. Automated downloading is increasingly an important part of the historians’ toolkit in the web age, as it facilitates this sort of work. In short, an automated script in this case will go to each forum page, gather the links to each individual post, and download them. It is important to incorporate pauses and limits on how quickly you download information, to conserve Internet Archive resources.
I then had several hundred files, each consisting of an individual discussion file. I was interested in exploring the degree to which the CRTC hearings contributed to a Canadian moral panic around cyberporn and abuse of children online. Plain text files provide options for researchers: they can be read with a writing program, akin to traditional archival research; they can be scanned by search programs, from built-in operating system versions to specialized information management systems like DevonThink; or using specialized humanities textual analysis tools. Electing to take the last course of action, I combined all of the individual files into one large plain-text file and loaded it into the web-based Voyant-Tools.com site.
But let’s return to the first example I used: the original archival box.
Digging into the WARC files themselves offers advantages beyond just using the publically accessible WaybackMachine. First, they can be created by anybody. Returning to our ActiveHistory.ca example, if we wanted to create a comprehensive web archive of that entire website, we could use the Programming Historian 2 lesson to install wget and execute one command: wget http://activehistory.ca/ –mirror –warc-file=”ah”.
Yet we do not have to create our own. At the Internet Archive, there are many such WARC files available, such as a just-in-time grab of the now-defunct Montreal Mirror community paper (with content between 1997 and 2010). Making this work all the more necessary, the Internet Archive released an entire web crawl of the 2011 World Wide Web in WARC format in 2012 to celebrate the attainment of the ten petabyte mark. Eighty terabytes are available for researchers: a copy of everything they could find. This will be a tremendous future resource for historians, and we need to begin planning for its utility now.
Let me walk through one program I have created. It takes a WARC file, and with one command generates an index, a finding aid, a searchable index, and let’s us see what it contains. <Here I may ad-lib a bit>
But since I may have seemed utopian, let me put on my activist/doomsday hat.
The dangers of digital loss, expressed in the early 1990s, are still with us.
Geocities, a popular web service that allowed users with little technical expertise to create their own websites, had opened in 1994. Over the subsequent fifteen years, over 38 million webpages were created there: an astounding and irreplaceable collection of early Internet history, and most likely an until-then unparalleled collection of individual, non-commercialized expression. Yahoo!, the Internet search giant, acquired Geocities in 1999 (it had then been the third-largest website in the world).
On 26 October 2009, Yahoo! “succeeded in destroying the most amount of history in the shortest amount of time, certainly on purpose, in known memory. Millions of files, user accounts, all gone.” (Archives Team) Its closure was quick and sudden: announced in April, the site shuttered in October.
Historians need to play a leading role in arguing for the preservation of our past, irrespective of whether it is in digital or analog format.
There is work to be done. New web archives will necessitate a rethinking of the historian’s craft, and it is my belief that we need to move sooner rather than later on this front. The 1980s and 1990s, if past practice holds true, will become the target of historical inquiry soon. There are challenges on the road ahead, but opportunities too. We should look forward with optimism.