Digital History and Web Scraping – McMaster University Workshop

This slideshow requires JavaScript.

You can download the full PDF of the slide deck here (22.3MB download)

Welcome to the Digital History and Web Scraping Workshop, part of the Demystifying Digital Scholarship series at the Lewis & Ruth Sherman Centre for Digital Scholarship. Here are the links that I’ll be going through in the workshop. You can also download the slides here, and have my contact information for follow up conversations. By the end of the workshop participants should know best practices in grabbing material, social media streams, and have some idea about how to analyze it all. And what we can’t cover in two hours, you’ll have here.

Plan for the day

  1. Introductions
  2. Defining Digital History
  3. Different Approaches to Gathering Data
    1. Major Source Repositories (From JSTOR to Lexis|Nexis and Google Books)
    2. Finding Sources Online
    3. Web Scraping
    4. Social Media Scraping
  4. Analyzing Your Data
  5. Conclusions and Time to Play

Major source repositories

  1. Google Books Advanced Search – https://books.google.com/advanced_book_search
  2. Internet Archive Advanced Search – http://archive.org/advancedsearch.php
    1. Programming Historian lesson (for information): http://programminghistorian.org/lessons/data-mining-the-internet-archive
  3. HathiTrust – https://www.hathitrust.org/
    1. Programming Historian lesson (for information): http://programminghistorian.org/lessons/text-mining-with-extracted-features
  4. JSTOR Data for Research – http://dfr.jstor.org/
  5. Lexis|Nexis – paywalled
  6. ProQuest – paywalled

Finding Sources Online

  1. Dream Cases (“The big red download button”)
    1. Epigraphic Database Heidelberg – http://edh-www.adw.uni-heidelberg.de/home
    2. Commonwealth War Graves Commission – http://www.cwgc.org/find-war-dead.aspx
  2. Doing it yourself.. stay tuned

Web Scraping

  1. WebRecorder.io – http://webrecorder.io
  2. Import.io – http://import.io
    1. Top 40 Lyrics site – http://www.top40db.net/Find/Songs.asp?By=Year&ID=1970
    2. Home Children at Library and Archives Canada – http://www.bac-lac.gc.ca/eng/discover/immigration/immigration-records/home-children-1869-1930/immigration-records/Pages/list.aspx?
    3. We will learn how to scrape these two sites
    4. Try yourself on this Wikipedia page – https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States

Social Media Scraping

  1. DocNow Project – http://www.docnow.io/
  2. DocNow Server Set up – http://52.33.92.32

Analysis

  1. Voyant Tools – http://voyant-tools.org/
  2. Programming Historian – http://programminghistorian.org/

Discussion and free-form time