Web Scraping and Digital History – AHA 2017 Workshop

This slideshow requires JavaScript.

You can download the entire slide deck here (23 MB PDF).

Welcome to Web Scraping and Digital History, part of the Getting Started in Digital History series at the American Historical Association’s annual meeting. Here are the links that I’ll be going through in the workshop. You can also download the slides here, and have my contact information for follow up conversations. It’s turned out to be a tall order to even get started with web scraping in three hours, but by the end of the workshop participants should know best practices in grabbing material, social media streams, preserving, and have some idea about how to analyze it all.

Plan for the day

  1. Introductions
  2. Defining Digital History
  3. Different Approaches to Gathering Data
    1. Major Source Repositories (From JSTOR to Lexis|Nexis and Google Books)
    2. Finding Sources Online
    3. Web Scraping
    4. Social Media Scraping
  4. Preserving Your Data
  5. Analyzing Your Data
  6. Conclusions and Time to Play

Major source repositories

  1. Google Books Advanced Search – https://books.google.com/advanced_book_search
  2. Internet Archive Advanced Search – http://archive.org/advancedsearch.php
    1. Programming Historian lesson (for information): http://programminghistorian.org/lessons/data-mining-the-internet-archive
  3. HathiTrust – https://www.hathitrust.org/
    1. Programming Historian lesson (for information): http://programminghistorian.org/lessons/text-mining-with-extracted-features
  4. JSTOR Data for Research – http://dfr.jstor.org/
  5. Lexis|Nexis – paywalled
  6. ProQuest – paywalled

Finding Sources Online

  1. Dream Cases (“The big red download button”)
    1. Epigraphic Database Heidelberg – http://edh-www.adw.uni-heidelberg.de/home
    2. Commonwealth War Graves Commission – http://www.cwgc.org/find-war-dead.aspx
  2. Doing it yourself.. stay tuned

Web Scraping

  1. WebRecorder.io – http://webrecorder.io
  2. Import.io – http://import.io
    1. Top 40 Lyrics site – http://www.top40db.net/Find/Songs.asp?By=Year&ID=1970
    2. Home Children at Library and Archives Canada – http://www.bac-lac.gc.ca/eng/discover/immigration/immigration-records/home-children-1869-1930/immigration-records/Pages/list.aspx?
    3. We will learn how to scrape these two sites
    4. Try yourself on this Wikipedia page – https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States

Social Media Scraping

  1. DocNow Project – http://www.docnow.io/
  2. DocNow Server Set up –
  3. Social Feed Manager – http://ec2-35-164-93-237.us-west-2.compute.amazonaws.com/ui/

Preserving Your Data

  1. CERN’s Zenodo Project – http://zenodo.org/
  2. Scholars Portal Dataverse – https://dataverse.scholarsportal.info/
  3. Institutional Repository at your University?


  1. Voyant Tools – http://voyant-tools.org/
  2. Programming Historian – http://programminghistorian.org/

Discussion and free-form time