SSHRC Proposal

Map of the Internet.

A Map of the Internet, via created by Matt Britt in November 2006. One of Wikipedia’s Featured Images.

An Infinite Archive? Developing HistoryCrawler to Explore the Internet Archive as a Historical Resource. 

Proposed Research: Objectives

This project will work with a groundbreaking new data resource of an Internet Archive sweep to explore how historians can use and develop new digital tools to carry out historical research on previously unconceivable amounts of information. The project has four key goals: first, to develop a case study generating a social history of 2011 as a proof-of-concept; second, to develop an innovative tool –HistoryCrawler – that historians (and eventually the general public) can quickly deploy to create “on-the-fly” finding aids for use with born-digital material; third, to advance the disciplinary conversation about the nature of the Internet Archive (as well as big data more generally) and the new research questions it may enable historians to ask/answer; and fourth, to train highly qualified personnel to tackle the emerging problem of data management in the Internet age. By addressing a percentage of the Internet Archive’s collections, this project will open up further opportunities for collaborative work among humanities researchers, computer scientists, and other stakeholders interested in synthesizing digital information.

Proposed Research: Context

Situating “HistoryCrawler” into the Scholarly Context

Historians are unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. They have been consciously underrepresented in large digital projects, such as the Harvard Cultural Observatory’s Culturomics project, ceding disciplinary ground to computer scientists, evolutionary biologists, and English literature scholars. Historians, however, bring an important perspective to the table: professional training in historiography, expertise in evaluating primary sources, experience in balancing various scales of time, and an awareness of decades of professional development in social, cultural, and political histories. If history is to continue as the leading discipline in understanding the social and cultural past, decisive movement towards the digital turn is necessary.

Due to the astounding growth of digital sources – “90% of the data in the world today has been created in the last two years alone” (IBM, 2012) – and an increasing technical ability to process them on a mass scale, historians must begin to develop new tools and disciplinary norms. These sources present both a boon and a challenge to historians. If the norm until the digital era was to have human information vanish, “[n]ow expectations have inverted. Everything may be recorded and preserved, at least potentially.” (Gleick, 2011) Useful historical information is being preserved at a rate that is accelerating with every passing day. The historians’ craft has already been transformed to some degree by new digital methods, from databases, word processing, and Internet searches. But, as a December 2012 report on historians’ research practices found, this has been slow. The report found that “[as] only a comparatively small share of the primary sources required by historians has been made available digitally, [this has tempered] the opportunity for new methods to take hold.” (Ithaka S + R, 2012)

While absent from some of the larger initiatives, historians nevertheless have been and are involved in several critical digital projects, resulting in fruitful research outcomes. Several projects in England have harnessed the wave of interest in big data to tell compelling historical stories about their pasts: the Criminal Intent project, for example, used the proceedings of the Old Bailey criminal courts to visualize large quantities of historical information. The project was funded by the Digging into Data Challenge, which brings together multinational research teams and funding from SSHRC and equivalent federal-level funding agencies in the United States, the United Kingdom, and beyond. Several other historical studies have been funded under their auspices: from studies of commodity trade in the 18th– and 19th-century Atlantic World to data mining the degrees of economic opportunity and spatial mobility in Britain, Canada, and the United States. Yet for many more historians, the technical and logistical challenges presented by these new large-scale digital sources are profound.

An important source for historians is the Internet Archive. This archive began a web archiving project in 1996 on a non-discriminate basis (Kahle, 1997): it crawls the World Wide Web and takes snapshots of what it finds, preserving them in a standardized file format known as a WARC file. These files are currently only accessible to those with specialized technical knowledge and fluency with a command line interface such as DOS or bash. WARC files may be the archival boxes of the future, but they are currently locked away from end users and relatively unknown. This proposed research will rectify that problem through a case study of Canadian WARC files held at the Archive.

To put the size of this collection into context, comparative figures help. If the American Library of Congress, the largest print repository in the world, represented an older metric of “big data,” the Internet Archive must now be understood as the largest archive in the world, ever. If we were to go into the Library of Congress and digitize each book (using an average of eight megabytes per book), we would have a dataset of 200 terabytes. In contrast, the Internet Archive now has ten petabytes of information saved (Internet Archive, 2012). Consider the graph at right: it is not a mistake that the Library of Congress is barely visible. The Internet Archive is 5,020% bigger. Are historians ready?

 A Case Study for Future Preparedness, and Scholarly Impact Objectives

A case study is the most fruitful way forward. On 26 October 2012, the Internet Archive announced that it would make an entire crawl of the Internet available to interested researchers: the entire collection amounts to 80 terabytes of information. While this information as a whole is too large a database to cover in its entirety during the timeframe of an IDG grant, a sub-set of “.ca” level domains (or 160,884 websites) is available, and is of appropriate scope.  This would be an unparalleled snapshot of the Internet in Canada during the Internet Archive crawl between 9 March 2011 and 23 December 2011, how it changed, and what its content was.  We would gain fascinating insight into the everyday lives of Canadians. What would a social or cultural history of 2011, conducted through these large-scale born-digital collections, reveal? This project will use several different methods, incorporated into HistoryCrawler. These methods will include sentiment analysis (the evaluation of a text’s emotional concent), the discovery of various words that surround individual concepts, the counting of word frequency, and the isolation of recurring and significant topics.

This is grounded in the fundamental concept of being able to quickly move between a distant reading of all files into a close reading of only one individual webpage. In this respect, HistoryCrawler will let users quickly move between the realm of “information overload” to the micro-scale of an individual website, itself from a discrete moment in time. Ultimately, HistoryCrawler will consist of original programming as well as the enhancement and combination of several existing open-source tools. The latter consist of the WayBackMachine (which allows a user to view a website as it originally appeared), the suite known as WARC tools (which compile full-text indexes and archives from large archival files), as well as off-the-shelf visualization tools including McGill’s “Voyant Tools” and IBM Research’s “Many Eyes.” These computational methods offer the only fruitful way for a social or cultural historian to explore large collections such as the Internet Archive.

This work is situated within the field of digital history, itself an offshoot of the broader digital humanities community. The digital humanities explore how technology can be integrated into traditional scholarly activities and be harnessed for the creation of new forms of scholarship and media. Digital history is the specific application of digital methodologies and media to historical questions, offering possibilities for a significant revision of contemporary approaches (overviews include Cohen and Rosenzweig, 2006; Cohen et al, 2008). Projects have engaged with big data” to answer questions of pressing historical importance. Culturomics, “the application of high-throughput data collection and analysis to the study of human culture,” offers an exciting new method of historical inquiry.  Through this methodology, popularized by the Google n-gram project and laid out in a groundbreaking Science article (Michel et al, 2010), we can see the rise and fall of cultural ideas over a discrete period of time.

Culturomics has produced one avenue and approach to big data; other tools exist such as the MAchine Learning for LanguagE Toolkit (MALLET) project out of the University of Massachusetts-Amherst, which deploys Latent Dirichlet Allocation modeling to make sense of big data (Blei, 2003). Notable projects include Mining the Dispatch, large-scale digitally-derived historiographies, and the outcomes of the Software Environment for the Advancement of Scholarly Research (SEASR).

By aiming to create an accessible and free tool that does not require enduser programming knowledge, HistoryCrawler will help historians conceive of big data as an opportunity to achieve the traditional mission of social history. The lives of everyday people can be recovered through large amounts of sources. The overall ebb and flow of history can be retrieved from WARC files.

Relationship of this Project to Ongoing Work

This project builds on my previous work in the area of digital history, making me well positioned to carry out this research. My recent series of research blog posts (beginning with http://bit.ly/RNeFOr) focused on the historical use of WARC files, the file format used by the Internet Archive, and had wide readership. As the sole author of “Automated Downloading with Wget,” a chapter in the peer-reviewed technical publication Programming Historian 2 (of which I am also an editor-at-large), I have been recognized as an expert on routinized humanities information retrieval from born-digital collections. I was also a co-author on the topic modeling chapter in that anthology, which is a computational technique for analyzing, organizing, and making sense of changes and trends within large amounts of information. Additionally, I have co-authored work on the computational turn in a forthcoming anthology by the University of California Press. My SSHRC postdoctoral fellowship focused on a digital history of postwar youth. Two draft papers have emerged from that project: one a large-scale distant reading of tens of thousands of music lyrics, and the second an examination of the first foray of youth onto the internet via a massive repository of information at Library and Archives Canada (LAC).

This project develops that avenue with an aim to building a foundation for subsequent historical work with born-digital sources, namely through the HistoryCrawler application. This IDG project also builds on another side project. With a team of collaborators, including William Turkel at Western University and the Museum of London (England), I have been carrying out social media analysis on their #citizencurators project, which sought to preserve and archive everyday life in London during the 2012 Olympics. Using topic modeling, textual analysis, and sentiment analysis, we have been successful in launching a case-study evaluation of how Twitter can be used as a historical source. HistoryCrawler will emerge as a single point of access to all of these features for born-digital sources.

Originality, Significance, and Expected Contribution to Knowledge

Every day most Canadians generate born-digital information that if held in a traditional archive would form a sea of boxes, folders, and unstructured data. Digital at the outset, such material can be found in blogs, webpages, and tweets, and it can take the form of text, image, video, and even interactive content. Historians, archivists and librarians must learn how to navigate this sea of digital material, for tomorrow’s history will only be unlocked through access to the digital archives being generated today.

We are currently experiencing a revolutionary shift in the medium of information. As the price of digital storage plummets and communications are increasingly disseminated via digitized text, images, and video, the primary sources left for future historians have dramatically changed. We need to ensure that we can adequately manage born-digital sources as historians. While there is no commonly accepted rule for when a topic becomes “history,” the timeframe is shortening as the speed of information dissemination accelerates.  For example, it took less than thirty years after the tumultuous year of 1968 for a varied, developed, and contentious North American historiography to appear on the topic of life in the 1960s (for historical examples see Owram, 1996 and even Levitt, 1984). In 2021, we will mark the thirtieth anniversary of the creation of the first website in August 1991. The 1990s were an important decade for digital sources (Johnson, 1999; Kuny, 1997, Library of Congress, 2010). By the decade’s end, the Internet had become the primary site for media consumption. Just as the media, government and business radically transformed their practices then, historians must be prepared to analyze this information today.

This project, “Developing HistoryCrawler,” will be a case study exploration of how historians can harness Internet Archive data for historical research. By focusing on concrete deliverables, such as the HistoryCrawler explorer, this project will focus on the tools we need to carry out research on previously inconceivable amounts of information. This grant will also lay the groundwork for a subsequent Partnership Development Grant aimed at bringing in private sector partners (specializing in data management and social media analytics) together with humanities researchers. The University of Waterloo’s expertise in computational, interdisciplinary, and private sector cooperation makes this an ideal location from which to begin this chapter in historical research.

We have an unparalleled ability to look into the everyday social lives of people today, through social media, as well as tremendous amounts of born-digital material dating from the 1990s onwards. Looking forward, we now measure current events in “tweets per minute,” such as the 158,690 tweets per minute (TPM) hit during the first American Presidential debate in October 2012 (Dugan, 2012). This defies conventional analyses and even the ability to comprehend it: 158,690 tweets per minute, taking a median of 60 characters per tweet and an average of 5 characters (plus a space) per word, means that in a single minute 5,289 average-length 300-word pages of paper were generated. This is a data deluge, but one that historians will not be able to ignore.

Proposed Research: Methodology

I will primarily approach this project by writing HistoryCrawler in the Mathematica programming language. Mathematica is an integrated platform for technical computing that will allow me to process, visualize, and interact with this exceptional array of over 160,884 websites. The research team will be presented with the epitome of “information overload,” facing millions of pages of text, images, videos, and sound. Making sense of it requires rapid prototyping, and Mathematica fits the bill; indeed, it was singled out by the Criminal Minds project as an essential component of their work (Criminal Minds). Mathematica offers potential collaboration opportunities. Mathematica allows researchers to freely and openly disseminate their programs to the public through their new Computational Document Format, which enables the distribution of interactive documents, allowing others to see, manipulate and use my data. As much of digital humanities and digital history scholarship takes place online, my work as a founding co-editor of ActiveHistory.ca, a successful public history website, will help with the dissemination of this project. I will encourage the public to follow my work as I undertake it through both a traditional blog as well as a repository being set up to share code, data, and findings for public consumption. All information will be released under a CreativeCommons 3.0 Share-Share-Alike license.

There are several key activities that will be carried out during the two-year term of this grant:

  • Data Infrastructure Development: The first stage of the grant uses funds to secure cloud-based data storage, hosted at the University of Waterloo, to allow project team members to access the material. The amount of data being used will be considerable, amounting to 5 or 6 TB.
  • Working with Internet Archive Material: The Internet Archive is one of the most ambitious information preservation projects ever carried out by humankind. To date, however, humanists have not systematically worked with it. During this project, we will discuss best practices to work with their custom-build archival files (WARC files, which are individual files that contain the entire content of a website). We will experiment with search strategies to discover the most efficient methodology, and disseminate findings relating to the limitations and advantages of the Internet Archive for historians.
  • Data Analysis and Visualization: The project team will then carry out the following steps with an aim of producing the HistoryCrawler product:
    • We will analyze the data with an eye to developing an experimental social history of 2011 based on the Internet Archive holdings. What we find will be fundamental to the design of HistoryCrawler. What can we learn about everyday life in Canada from it? What questions can we ask, and what answers can this data reveal? What topics were predominant? What changes could we find? We can also ask more specific questions, using them as case studies, for example: How have Canadians interacted with their past?
    • We will develop a custom-built informational tool, HistoryCrawler, to process this archive. This tool will allow users to find their relevant information quickly; in other words, HistoryCrawler will be a search engine with a historical dimension. For example, say we wanted to find what Canadians thought about “peacekeeping”: we would be able to move from a macro-level of the .ca domain, towards more focused sites, towards a quick visualization of what born-digital archival resources are available on this topic. This will involve tweaking existing open-source tools, such as the WayBackMachine, WARC Tools, and Lynx (an open-source text browser). What do historians need that traditional search engines offer? How can we better grasp the temporal dimension? How can this information be successfully archived?
    • We will launch HistoryCrawler for other researchers. With the help of an information visualization student, we will create graphics that illustrate the challenges, rewards, and promise of the Internet Archive for humanists.


References

Blei, David M., Ng, Andrew Y., and Jordan, Michael I. “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3 (2003): 993-1022.

Cohen, Daniel J. and Roy Rosenzweig. Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web. Philadelphia: University of Pennsylvania Press, 2006.

Cohen, Daniel, M. Frisch, P. Gallagher, S. Mintz, K. Sword, M.T. Kirsten, A.M. Taylor, W.G. Thomas III, W.J. Turkel. “Interchange: The Promise of Digital History.” Journal of American History 95 (2), September 2008: 442-51. http://www.journalofamericanhistory.org/issues/952/interchange/index.html, accessed 29 July 2011.

Criminal Minds. “What We Learned.” CriminalIntent.org, 7 June 2011. Available online, http://criminalintent.org/2011/06/what-we-learned/, accessed 13 January 2013.

Dugan, Lauren. “10.3 Million Tweets Sent During Last Night’s Presidential Debates.” AllTwitter. Available online, http://www.mediabistro.com/alltwitter/presidential-debates-10-million-tweets_b29439. Accessed 13 January 2013.

Gleick, James. The Information: A History, a Theory, a Flood. New York: Pantheon, 2011.

Google Books. http://books.google.ca/. Accessed 19 August 2011.

IBM Research. “What is Big Data?” http://www-01.ibm.com/software/data/bigdata/. Accessed 7 January 2013.

Internet Archive. http://www.archive.org/. Accessed 19 August 2011.

Internet Archive, ““10,000,000,000,000,000 Bytes Archived!,” Internet Archive Blog, 26 October 2012, Available online, http://blog.archive.org/2012/10/26/10000000000000000-bytes-archived/, Accessed 19 December 2012.

Johnson, George. “Ideas & Trends: Shall I Compare Thee to a Swarm of Insects?; Searching for the Essence of the World Wide Web,” New York Times, 11 April 1999, available online, http://www.nytimes.com/1999/04/11/weekinreview/ideas-trends-shall-compare-thee-swarm-insects-searching-for-essence-world-wide.html, accessed 7 January 2013.

Kahle, Brewster. “Archiving the Internet.” Scientific American (March 1997), available online, http://www.uibk.ac.at/voeb/texte/kahle.html, accessed 7 January 2013.

Kuny, Terry. “A Digital Dark Ages? Challenges in the Preservation of Electronic Information.” 63rd IFLA Council and General Conference, September 1997. Available online. http://ifla.queenslibrary.org/iv/ifla63/63kuny1.pdf, accessed 7 January 2013.

Ithaka S + R. “Supporting the Changing Research Practices of Historians,” December 2012, http://www.sr.ithaka.org/news/understanding-historians-today-—-new-ithaka-sr-report, accessed 7 January 2013.

Levitt, Cyril. Children of Privilege: Student Revolt in the Sixties. Toronto: University of Toronto Press, 1984.

Library and Archives Canada. “Government of Canada Web Archive,” Library and Archives Canada Website, 17 October 2007, available online, http://www.collectionscanada.gc.ca/webarchives/index-e.html. Accessed 7 January 2013.

Library of Congress. “About the Library of Congress Web Archives.” Library of Congress Website. Undated. Availiable online, http://www.loc.gov/webarchiving/faq.html#faqs_02. Accessed 7 January 2013.

Library of Congress. “Preserving Our Digital Heritage: The National Digital Information Infrastructure and Preservation Program 2010 Report, a Collaborative Inititiatve of the Library of Congress.” Digitalpreservation.gov. Available online, http://www.digitalpreservation.gov/multimedia/documents/NDIIPP2010Report_Post.pdf, accessed 19 December 2012.

Michel, Jean-Baptiste, Y. K. Shen, A.P. Aiden, A. Veres, M. Gray, Google Books Team, J. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M.A. Nowak, E.L. Aiden. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014) 14 January 2011: 176-182. Published Online Ahead of Print: 16 December 2010, http://www.sciencemag.org/content/331/6014/176, accessed 29 July 2011.

Owram, Doug. Born at the Right Time: A History of the Baby Boom Generation. Toronto: University of Toronto Press, 1996.