<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Ian Milligan</title>
	<atom:link href="http://ianmilligan.ca/feed/" rel="self" type="application/rss+xml" />
	<link>http://ianmilligan.ca</link>
	<description>A Digital, Public, and Youth Historian of 20th-Century Canada</description>
	<lastBuildDate>Fri, 24 May 2013 16:28:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='ianmilligan.ca' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Ian Milligan</title>
		<link>http://ianmilligan.ca</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://ianmilligan.ca/osd.xml" title="Ian Milligan" />
	<atom:link rel='hub' href='http://ianmilligan.ca/?pushpress=hub'/>
		<item>
		<title>Putting it all Together: WARC to Output</title>
		<link>http://ianmilligan.ca/2013/05/24/putting-it-all-together-warc-to-output/</link>
		<comments>http://ianmilligan.ca/2013/05/24/putting-it-all-together-warc-to-output/#comments</comments>
		<pubDate>Fri, 24 May 2013 16:24:48 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Digital History]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1414</guid>
		<description><![CDATA[After a great day yesterday at Code4Lib North, and playing with some of Nick Ruest&#8217;s WARC files from the &#8216;Free Dale Askey&#8217; collection, I&#8217;ve put everything together with the bash and Mathematica script. I&#8217;ll be playing with some of this &#8230; <a href="http://ianmilligan.ca/2013/05/24/putting-it-all-together-warc-to-output/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1414&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="attachment_1415" class="wp-caption alignleft" style="width: 246px"><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-24-at-12-18-22-pm.png"><img class="size-medium wp-image-1415" alt="Output" src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-24-at-12-18-22-pm.png?w=236&#038;h=300" width="236" height="300" /></a><p class="wp-caption-text">Output</p></div>
<p>After a great day yesterday at <a href="http://wiki.code4lib.org/index.php/North#Fourth_Meeting:_Ryerson_University.2C_May_23rd_and_24th.2C_2013">Code4Lib North</a>, and playing with some of <a href="http://freedaleaskey.plggta.org">Nick Ruest&#8217;s WARC files from the &#8216;Free Dale Askey&#8217; collection</a>, I&#8217;ve put everything together with the bash and <em>Mathematica</em> script.</p>
<p>I&#8217;ll be playing with some of this stuff in two weeks at the <a href="http://www.cha-shc.ca/en/Homepage_69/items/26.html"><em>Canadian Historical Association</em>&#8216;s annual meeting in Victoria</a>, in my presentation entitled &#8220;The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files.&#8221; Slides and text&#8217;ll be up later, and I will also make the full text paper available to interested parties.</p>
<p>I see a potential use for this in collections of WARC files where there are no finding aids, just a bunch of files: which I think can be relatively common in &#8216;just-in-time&#8217; grabs or in other case studies. In the best case, I <a href="https://github.com/ruebot/arxivdaleascii">think the script used for the Dale Askey collection is great</a> &#8211; each file has an attached screenshot and PDF in addition to the WARC &#8211; but that this has a role for cases where little information is attached (i.e. more useful for cases like <a href="http://freedaleaskey.plggta.org/websites/00049-20130221">this one</a> than <a href="http://freedaleaskey.plggta.org/websites/00049-20130223">this one with more data</a>).</p>
<p>Here&#8217;s how it works. Unfortunately, you need <em>Mathematica</em><em>.</em></p>
<p><strong>Two files:</strong><br />
- <strong><a href="https://raw.github.com/ianmilligan1/Historian-WARC-1/master/all-together-mma.sh">All-Together-MMA.sh</a></strong>: Takes a WARC file passed to it from the command line, generates full-text, topic models it, and then invokes&#8230;<br />
- <strong><a href="https://raw.github.com/ianmilligan1/Historian-WARC-1/master/WARC-to-Analysis-single-file.m">WARC-to-Analysis-single-file.m</a></strong>: a <em>Mathematica</em> script that <a title="WARC Analysis Using Mathematica" href="http://ianmilligan.ca/2013/05/22/warc-analysis-using-mathematica/">generates the PDF file discussed in the last post</a>.</p>
<p><strong>How to run it:</strong></p>
<p>On command line, once made executable (otherwise prepend sh):<br />
<code>./all-together-mma.sh 00016-2013_02_23.warc</code></p>
<p>This takes the WARC file, one of the Dale Askey collection, and runs it through the script. With proper directories set in the files, it generates output as above in one step. A big benefit of this is that I can now automate this across a ton of WARC files.</p>
<p><strong></strong><strong>Work to do:</strong></p>
<p>- Need to refine stop words<br />
- Topic models are set up for large corpuses, so running 50 topics on a <em>single</em> page is overkill.<br />
- Sparklines are set up for large corpuses as well, so output is weird on one page. But still can be moderately useful.<br />
- Integrate with AlchemyAPI for sentiment analysis? Multiple KWICs?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1414/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1414/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1414&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/24/putting-it-all-together-warc-to-output/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-24-at-12-18-22-pm.png?w=236" medium="image">
			<media:title type="html">Output</media:title>
		</media:content>
	</item>
		<item>
		<title>WARC Analysis Using Mathematica</title>
		<link>http://ianmilligan.ca/2013/05/22/warc-analysis-using-mathematica/</link>
		<comments>http://ianmilligan.ca/2013/05/22/warc-analysis-using-mathematica/#comments</comments>
		<pubDate>Wed, 22 May 2013 19:28:19 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Digital History]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1401</guid>
		<description><![CDATA[My earlier workflow took a WebARChive (WARC) file (or generated one based on a website) and, using older version of WARC-Tools, generated a full text index. It finished by generating an output in the Stanford Termite browser. Given the size &#8230; <a href="http://ianmilligan.ca/2013/05/22/warc-analysis-using-mathematica/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1401&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><div id="attachment_1408" class="wp-caption alignleft" style="width: 310px"><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-27-06-pm.png"><img class="size-medium wp-image-1408" alt="This program creates a PDF like this from a WARC file, building on previous work." src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-27-06-pm.png?w=300&#038;h=298" width="300" height="298" /></a><p class="wp-caption-text">This program creates a PDF like this from a WARC file, building on previous work.</p></div><a title="A WARC to Topic Model Visualization Workflow" href="http://ianmilligan.ca/2013/05/15/warc-to-topic-models/">My earlier workflow took a WebARChive (WARC) file</a> (or generated one based on a website) and, using older version of WARC-Tools, generated a full text index. It finished by generating an output in the Stanford Termite browser. Given the size of these archives, and the sheer amount of text within them, I believe that these visualizations help us &#8216;see through the box,&#8217; as it were, and ascertain their relevancy for our research topics.</p>
<p>I also have the ulterior motive of getting historians more involved in this topic (there are certainly a few already working in the field) &#8211; we were around to design the first generations of archival boxes (historians were fundamental in the early days of the archival profession), and want us around as we tackle the second generation of the web archive box.</p>
<p>Today, using <em>Mathematica</em>, I developed my script further. It <a href="http://reference.wolfram.com/mathematica/tutorial/MathematicaScripts.html">can be called from the command line if you have </a><em><a href="http://reference.wolfram.com/mathematica/tutorial/MathematicaScripts.html">Mathematica</a>, </em>and so can be built into the aforementioned workflow. It does the following:</p>
<div id="attachment_1402" class="wp-caption alignright" style="width: 310px"><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-19-09-pm.png"><img class="size-medium wp-image-1402" alt="In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however." src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-19-09-pm.png?w=300&#038;h=148" width="300" height="148" /></a><p class="wp-caption-text">In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.</p></div>
<p>- takes the fulltext file and topic modelling data generated by the previous script;<br />
- generates word frequency information and displays them in a word cloud;<br />
- provides Keyword-in-Context of specific words (or, if run from <em>Mathematica</em>, can provide dynamic information as seen at right);<br />
- and visualizes the topic models, in declining order of overall prominence within the WARC file, using sparklines to demonstrate whether it is evenly spread throughout the file or just in a few files.</p>
<p>It then generates a long PDF file that you could store alongside the file, or use &#8211; with the specific keywords that you&#8217;re using &#8211; in an attempt to ascertain whether a given WARC file is handy for your research. It also shows how we have moved from large, ungainly WARC files into an area where we can apply text mining tools to them. Click through to see the final output:<span id="more-1401"></span></p>
<p>Download a <a href="http://ianmilli.files.wordpress.com/2013/05/trial-1.pdf">trial-1</a> here, or see below for a graphical version. <a href="https://github.com/ianmilligan1/Historian-WARC-1/blob/master/WARC-to-Analysis.nb">Code is at github</a>:</p>
<p><a href="http://ianmilli.files.wordpress.com/2013/05/trial-1.jpg"><img class="aligncenter size-large wp-image-1405" alt="trial-1" src="http://ianmilli.files.wordpress.com/2013/05/trial-1.jpg?w=292&#038;h=1024" width="292" height="1024" /></a></p>
<p>Work still remains to be done, of course, especially on cleaning up the text that&#8217;s going into the system. That&#8217;s for tomorrow, however.</p>
<p>If anybody has anything they think would be a good addition, or other things you would like to see, feel free to comment, <a href="mailto:i2milligan@uwaterloo.ca">e-mail</a>, or <a href="https://twitter.com/ianmilligan1">tweet me.</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1401/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1401/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1401&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/22/warc-analysis-using-mathematica/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-27-06-pm.png?w=300" medium="image">
			<media:title type="html">This program creates a PDF like this from a WARC file, building on previous work.</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-22-at-3-19-09-pm.png?w=300" medium="image">
			<media:title type="html">In Mathematica, you can change the search word and see KWIC change dynamically. The final PDF output requires some pre-defined keywords, however.</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/trial-1.jpg?w=292" medium="image">
			<media:title type="html">trial-1</media:title>
		</media:content>
	</item>
		<item>
		<title>An Aside: Blog Post Picked up in Condensed Format by Nature</title>
		<link>http://ianmilligan.ca/2013/05/16/an-aside-blog-post-picked-up-in-condensed-format/</link>
		<comments>http://ianmilligan.ca/2013/05/16/an-aside-blog-post-picked-up-in-condensed-format/#comments</comments>
		<pubDate>Thu, 16 May 2013 21:58:29 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Digital History]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1389</guid>
		<description><![CDATA[Put this in the &#8220;another reason why it&#8217;s good to blog&#8221; pile. My blog post from last month, which argued that Yahoo! Messages was consciously destroying a fifteen-year swath of history, led Nature magazine to ask if I would submit it &#8230; <a href="http://ianmilligan.ca/2013/05/16/an-aside-blog-post-picked-up-in-condensed-format/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1389&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-16-at-5-52-32-pm.png"><img class="alignleft size-medium wp-image-1390" style="border:1px solid black;" alt="Screen Shot 2013-05-16 at 5.52.32 PM" src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-16-at-5-52-32-pm.png?w=300&#038;h=202" width="300" height="202" /></a>Put this in the &#8220;another reason why it&#8217;s good to blog&#8221; pile. My <a title="Yahoo! Has Probably Destroyed the Most History, Ever – And Historians Need to Wake Up" href="http://ianmilligan.ca/2013/04/03/yahoo-sucks-historians-wake-up/">blog post from last month, which argued that Yahoo! Messages was consciously destroying a fifteen-year swath of history</a>, led <a href="http://www.nature.com/nature/index.html"><em>Nature</em> magazine</a> to ask if I would submit it in condensed format to their <a href="http://www.nature.com/nature/authors/gta/others.html">Correspondence section</a> (behind a paywall, unfortunately, but you&#8217;ll be able to read it if you&#8217;re on an institutional IP connection). <a href="http://www.nature.com/nature/journal/v497/n7449/full/497317b.html">Click here to view it</a>.</p>
<p>I submitted it, and after <span style="text-decoration:underline;">further</span> pruning by their staff (an aside to this aside: great editorial staff, from the contact people to the copy editors), a <a href="http://www.nature.com/nature/current_issue.html">very short version appeared in today&#8217;s issue</a>.</p>
<p>It&#8217;s almost certainly the shortest thing I&#8217;ve ever written, but I like to think that it&#8217;ll get some readers and hopefully encourage more  researchers to hop on the digital preservation bandwagon.</p>
<p>Thanks to all my readers who reblogged and retweeted my earlier post, which helped spread it around the Internet.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1389/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1389/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1389&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/16/an-aside-blog-post-picked-up-in-condensed-format/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-16-at-5-52-32-pm.png?w=300" medium="image">
			<media:title type="html">Screen Shot 2013-05-16 at 5.52.32 PM</media:title>
		</media:content>
	</item>
		<item>
		<title>A WARC to Topic Model Visualization Workflow</title>
		<link>http://ianmilligan.ca/2013/05/15/warc-to-topic-models/</link>
		<comments>http://ianmilligan.ca/2013/05/15/warc-to-topic-models/#comments</comments>
		<pubDate>Wed, 15 May 2013 18:36:27 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1373</guid>
		<description><![CDATA[Time to share some tinkering, in the hopes that somebody, somewhere, might find it helpful. Regular readers will know that I&#8217;m interested in how historians can approach web archives, as discussed in a three part series in late 2012 (see &#8230; <a href="http://ianmilligan.ca/2013/05/15/warc-to-topic-models/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1373&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Time to share some tinkering, in the hopes that somebody, somewhere, might find it helpful. <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Regular readers will know that I&#8217;m interested in how historians can approach web archives, as discussed in a three part series in late 2012 (see part <a title="WARC Files: A Challenge for Historians, and Finding Needles in Haystacks" href="http://ianmilligan.ca/2012/12/12/warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks/">one</a>, <a title="WARC Files, Part Two: Using WARC Tools to Get Closer to that Needle" href="http://ianmilligan.ca/2012/12/13/warc-files-part-two-using-warc-tools/">two</a>, and <a title="WARC Files, Part Three: Building a Full-Text Index" href="http://ianmilligan.ca/2012/12/14/warc-files-part-three-building-a-full-text-index/">three</a>). As I&#8217;ve stressed, in both tweets and in some draft writing: <strong>Historians need to understand web archives, however, as we will be professional end users of these archives.</strong>  We played a critical role in shaping the modern practice of traditional archiving. Let us make sure that historians are present for the next step. There&#8217;s a conversation, but its <em>largely</em> amongst people involved in web archiving as creators rather than as users.</p>
<p>[<a href="https://github.com/ianmilligan1/Historian-WARC-1">if you want to skip to my code, it's here</a>]</p>
<p><strong>So here&#8217;s me positing a problem</strong>: Some web archives do not have description, so you aren&#8217;t sure what you&#8217;re going to find inside. This includes some just-in-time web saves, like this <a href="http://archive.org/details/montrealmirror_1997_2010">mirror of the <em>Montreal Mirror&#8217;s </em>website</a>. There&#8217;s always an item listing, automatically generated, that lets you know what exactly is in the website. When dealing with <a href="http://archive.org/details/wide00002">wide web crawl data, part of that massive 80TB dataset</a>, this is a life saver. Very briefly: Web ARChive files are complicated containers of multiple files &#8211; <a href="http://activehistory.ca/">ActiveHistory.ca</a>, for example, is made up of over 18,000 files. That&#8217;d choke a file system, but you can turn it into one Web ARChive that you can play with later.<em><br />
</em></p>
<p><strong>Furthermore, these WARC files are too big. Wouldn&#8217;t it be nice if you could, at a glance, see what ianmilligan.ca is about without having to read what I&#8217;m writing here? </strong>(yes, but also imagine if you were just looking at a bunch of visualizations &#8211; would be invaluable in a research project)</p>
<p>But for the historian, it&#8217;s not terribly useful in and of itself. What if we had a lot of these files, how could we quickly see what related to our topic, and what didn&#8217;t? Would we be able to automate it?</p>
<p><strong></strong>Here&#8217;s my idea, which I cooked up as a way to learn some more technical skills and invent a tool to help me in my workflow. What if we could take a Web ARChive file (WARC) and then hook it up to a topic modelling visualizer, like Stanford&#8217;s Termite [<a href="http://vis.stanford.edu/papers/termite">see their paper here</a>]?<span id="more-1373"></span> This flow would ideally take a WARC file and then give us an output like this:</p>
<div id="attachment_1377" class="wp-caption aligncenter" style="width: 621px"><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-16-56-pm.png"><img class="size-large wp-image-1377" alt="Here is Termite running on the text of the 1933 Canadian Commonwealth Federation's Regina Manifesto." src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-16-56-pm.png?w=611&#038;h=578" width="611" height="578" /></a><p class="wp-caption-text">Here is Termite running on the text of the <a href="http://www.socialisthistory.ca/Docs/CCF/ReginaManifesto.htm">1933 Canadian Commonwealth Federation&#8217;s Regina Manifesto</a>. For more information on topic modelling, <a href="http://programminghistorian.org/lessons/topic-modeling-and-mallet">you can go check out our Programming Historian piece</a>.</p></div>
<p>Termite is great, but requires its data formatted in a particular way: a file number and then a document on one single line, in tab-separated format. It isn&#8217;t terribly friendly to just dump websites into. <span style="text-decoration:underline;"><strong>This is preliminary work, started two days ago, simultaneous with me learning how to script in bash. It works for me, but there&#8217;s no sense in me working in a</strong><strong> bubble</strong></span>.</p>
<p>Here is my attempt to introduce a workflow, written in Bash, drawing on various Python tools, written on MAC OS X and presumably compatible with Linux. What it does is the following:</p>
<p>(1) Finds a website, such as <a href="http://ianmilligan.ca" rel="nofollow">http://ianmilligan.ca</a>, and mirrors it all into a single WARC file</p>
<p>(2) Transforms that WARC file into a full-text searchable index, which is far more usable in terms of size. Downsides: we lose some of the context provided by images, etc. If images have alt-image tags, those are filtered in with the text.</p>
<p>(3) Takes that full-text searchable index and transforms into Termite format, each individual file that makes up a WARC being given a single line.</p>
<p>How does it work?</p>
<hr />
<p><strong>Step One: Install Initial Tools</strong></p>
<p><strong>WGET: </strong>It requires wget, with WARC functionality (so no old versions). If you don&#8217;t have that, you&#8217;ll need to install it. If you&#8217;re on a MAC, you need to download xCode, command line tools, and compile it. Follow the specific instructions in my <a href="http://programminghistorian.org/lessons/automated-downloading-with-wget"><em>Programming Historian </em>piece</a>.</p>
<p><strong>LYNX: </strong>These tools use Lynx to read the WARCed webpages and display them. Newer versions use Beautiful Soup 4, but I&#8217;m sticking with Lynx for now (I&#8217;ve been working with it for a while and am fairly impressed). <a href="https://wincent.com/wiki/Installing_Lynx_2.8.7_on_Mac_OS_X_10.6_Snow_Leopard">Follow the detailed instructions here</a>.</p>
<hr />
<p><strong>Step Two: Download and Install Stanford/Termite</strong></p>
<p><strong></strong>Remarkably for such a project, the documentation is <strong>exceptional</strong>. <a href="https://github.com/StanfordHCI/termite">They&#8217;re on github, so check it out there</a>.</p>
<hr />
<p><strong>Step Three: Download and Install Historian-WARC-1 Files</strong></p>
<p><strong></strong>For these, <a href="https://github.com/ianmilligan1/Historian-WARC-1">check out my github repository here</a>.</p>
<p>Download them all to a directory (click the zip button above for convenience). <strong>The easiest way is to place all these files into the same directory that you have Stanford-Termite running. </strong></p>
<p>You&#8217;ll need to initialize the WARC-Tools. In your terminal window, in the /WARC/Hanzo-Warc directory, run the following commands:</p>
<p><code>./setup.py build</code></p>
<p><code>sudo ./setup.py build install</code></p>
<p>You&#8217;ll be asked for your root password on the latter.</p>
<hr />
<p><strong>Step Four: Point all the Directories at the Right Place</strong></p>
<p><strong></strong>For the Historian-WARC-1 files, you&#8217;ll need to change a few things.</p>
<p>Editing <strong>All-Together.sh</strong>, make sure to:</p>
<p>Change the <code>URLTOGET</code> value to the website you&#8217;re interested in looking at. You might want to change the <code>OUTPUT</code> field too.</p>
<p>Then you&#8217;ll need to change, in <strong>line 38</strong>, the path to the warc-tools-mandel directory.</p>
<p>Then, editing <strong>TRIAL.CFG</strong>, make sure to:</p>
<p>Change the <strong>path</strong> to the directory you&#8217;ve installed everything in. There are three of these. You may also need to alter the number_of_seriated terms variable as well.</p>
<p><strong>Otherwise, try to read the comments, and feel free to leave comments below. If it all works, you can run this script &#8211; it grabs a website, WARCs it, and then visualizes it in termite on localhost:8888</strong>.</p>
<div id="attachment_1378" class="wp-caption aligncenter" style="width: 621px"><a href="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-15-06-pm1.png"><img class="size-large wp-image-1378" alt="Output from ianmilligan.ca - you can get a sense of who I am pretty quickly, without having to read all the content. (thank god, eh?)" src="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-15-06-pm1.png?w=611&#038;h=539" width="611" height="539" /></a><p class="wp-caption-text">Output from ianmilligan.ca &#8211; you can get a sense of who I am pretty quickly, without having to read all the content. (thank god, eh?)</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1373/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1373/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1373&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/15/warc-to-topic-models/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-16-56-pm.png?w=611" medium="image">
			<media:title type="html">Here is Termite running on the text of the 1933 Canadian Commonwealth Federation&#039;s Regina Manifesto.</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/screen-shot-2013-05-15-at-2-15-06-pm1.png?w=611" medium="image">
			<media:title type="html">Output from ianmilligan.ca - you can get a sense of who I am pretty quickly, without having to read all the content. (thank god, eh?)</media:title>
		</media:content>
	</item>
		<item>
		<title>It&#8217;s Getting Harder to Commit History: SSHRC Postdocs, 1995-2013</title>
		<link>http://ianmilligan.ca/2013/05/13/sshrc-postdocs-1995-2012/</link>
		<comments>http://ianmilligan.ca/2013/05/13/sshrc-postdocs-1995-2012/#comments</comments>
		<pubDate>Mon, 13 May 2013 16:01:57 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Canada]]></category>
		<category><![CDATA[Higher Education]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1362</guid>
		<description><![CDATA[One of the most popular blog posts on this site is my SSHRC Postdocs: What&#8217;s Going On post, which plots applications, successful awards, and the success rate. Back in February, I tweeted a revised version of the chart, but wanted &#8230; <a href="http://ianmilligan.ca/2013/05/13/sshrc-postdocs-1995-2012/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1362&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>One of the most popular blog posts on this site is my <a href="http://ianmilligan.ca/2012/08/26/sshrc-postdocs-whats-going-on/">SSHRC Postdocs: What&#8217;s Going On post</a>, which plots applications, successful awards, and the success rate. Back in February, I tweeted a revised version of the chart, but wanted to put it here with some additional commentary:</p>
<div id="attachment_1365" class="wp-caption aligncenter" style="width: 621px"><a href="http://ianmilli.files.wordpress.com/2013/05/success-rate-may-2013.png"><img class="size-large wp-image-1365" alt="SSHRC Postdoctoral Fellowship Success Rates, 1995-2013." src="http://ianmilli.files.wordpress.com/2013/05/success-rate-may-2013.png?w=611&#038;h=330" width="611" height="330" /></a><p class="wp-caption-text">SSHRC Postdoctoral Fellowship Success Rates, 1995-2013.</p></div>
<p>The above includes the 2013-14 results. We&#8217;re seeing a blip upwards in the success rate: there are more awards and a slight decline in the number of applicants. We&#8217;ll need a few more years to see whether this is a statistical hiccup or the beginning of a new trend. There might be some chill factor with the impending tightening restrictions around the number of applications, <a href="http://ianmilligan.ca/2012/07/11/a-few-thoughts-on-sshrc-changes-to-gradpostdoc-support-a-mixed-bag/">announced last summer</a>. In any case, this success rate still is low: although it is an uptake, it is still lower than at any time between the 1996-97 competition year and the 2008-09 competition year.</p>
<p><strong>An additional thing to keep in mind: </strong>The &#8220;History Wars&#8221; have been getting a lot of attention lately, most recently due to the federal government&#8217;s decision to begin a &#8220;comprehensive review of significant aspects in Canadian history.&#8221; The specifics are best left to others, <a href="http://fullcomment.nationalpost.com/2013/05/13/history-for-monday/">notably Active History co-editor Thomas Peace</a>. I don&#8217;t want to get into it, but certainly there&#8217;s a tendency for the Conservatives to want to encourage scholarship that they find politically agreeable (military history, Diefenbaker) &#8211; just as there was for the Liberals (peacekeeping, multiculturalism). But keep in mind, as we discuss historians, that it&#8217;s quite frankly getting harder to do <strong>any kind of history as a professional undertaking. </strong></p>
<p>There&#8217;s a war on junior scholars in Canada. That&#8217;s being a bit provocative, I know, and the federal government doesn&#8217;t deserve all the blame. We have seen declines in total postdoctoral fellowships awarded in several years as seen above, which we can attribute to funding issues. Universities and provincial governments also deserve some blame, as they&#8217;ve flooded the market. But whatever the root causes, we&#8217;re seeing a major issue of human capital in the data above.</p>
<p><strong>This affects</strong> <span style="text-decoration:underline;"><strong>all of us</strong></span>: whether you&#8217;re a military historian, a social historian, a Conservative, a Liberal, or a New Democrat. If we really care about history, let&#8217;s put our money where our mouth is and help fund the next generation of historians.<strong> </strong>It affects junior scholars most of all, but also senior scholars. This is the future of the profession.</p>
<p><strong>A return to 2004-05 success rates would mean 62 more postdoctoral fellowships would have been awarded this year</strong> (244 out of 903 would have been a 27.02% success rate, which we had in 2004-05).</p>
<p><strong>It&#8217;s getting harder to commit history.</strong></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1362/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1362/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1362&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/13/sshrc-postdocs-1995-2012/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/05/success-rate-may-2013.png?w=611" medium="image">
			<media:title type="html">SSHRC Postdoctoral Fellowship Success Rates, 1995-2013.</media:title>
		</media:content>
	</item>
		<item>
		<title>An Aside: An Ode to Exploratory Research</title>
		<link>http://ianmilligan.ca/2013/05/10/an-aside-an-ode-to-exploratory-research/</link>
		<comments>http://ianmilligan.ca/2013/05/10/an-aside-an-ode-to-exploratory-research/#comments</comments>
		<pubDate>Fri, 10 May 2013 20:57:23 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Asides]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1352</guid>
		<description><![CDATA[With the semester done, the search committee I was on having wrapped up, and &#8211; finally &#8211; my two article drafts (one on moral panics on the early Canadian Internet and one on WebArchiving) completed, I had an entire afternoon &#8230; <a href="http://ianmilligan.ca/2013/05/10/an-aside-an-ode-to-exploratory-research/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1352&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>With the semester done, the search committee I was on having wrapped up, and &#8211; finally &#8211; my two article drafts (one on moral panics on the early Canadian Internet and one on WebArchiving) completed, I had an entire afternoon of <strong>guilt-free exploratory research</strong>.</p>
<p><strong>I love exploratory research</strong>. A <a title="CV" href="http://ianmilligan.ca/cv/">forthcoming article</a> grew out of exploratory research (<a title="Illusionary Order: Cautionary Notes for Online Newspapers" href="http://ianmilligan.ca/2012/03/26/illusionary-order-cautionary-notes-for-online-newspapers/">blogged about here</a>), when I was messing around with dissertations and citation counts. Now, on the other hand, that obscures out the <strong>days and days</strong> that I&#8217;ve <a href="http://www.slate.com/blogs/browbeat/2013/03/06/literally_definition_has_changed_over_the_years_dictionaries_recognize_this.html">literally</a> spent hitting up against dead ends, batting my wall up against bad data, technical limitations, or sources that never really went everywhere. I have a terabytes of external hard drives, filled up with datasets, some of them representing a few days of work that won&#8217;t soon see the light of day.</p>
<p>So it&#8217;s nice to have these guilt free days to just play, in a constructive way. <strong><a href="http://lifehacker.com/5932586/make-work-feel-less-like-work-with-the-8020-rule">I think of it as akin to the Google 20% time</a>.</strong> To <a href="http://statusboard.archive.org">check out what&#8217;s new at the Internet Archive</a>. To take an abstract problem and play with it in a programming language. To <a href="http://www.cbc.ca/the180/excerpts/2013/05/08/does-canadas-history-need-revision/">listen to a CBC debate on Canadian history</a>.</p>
<p><strong>So what have I discovered today?<span id="more-1352"></span></strong></p>
<p><strong></strong>- Well, <a href="http://www.jimclifford.ca">Jim Clifford</a> and I spent a good hour figuring out how to do a pattern match for a <strong>list made up of mixed integers and strings in <em>Mathematica</em></strong>. Sounds boring, eh? Next time this comes up, it won&#8217;t take an hour &#8211; it&#8217;ll take a second. That&#8217;s the joy of programming. And I&#8217;ve hopefully made just a small, minor contribution to the understanding of global historical commodity flows. <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>And because it took so damn long: <code>Cases[test, {x_ /; 1620 &lt;= x &lt;= 1629, _, _}]</code> got us the dates we needed.</p>
<p>- You can <strong>download Internet Archive collections <em>en masse</em>, thanks to their openness to using wget on their collections</strong>! <a href="http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/">Check out the blog post here</a>, and try generating <a href="http://archive.org/advancedsearch.php">your own list here</a>. I&#8217;m downloading a massive collection of magazines right now. Once again, there&#8217;s a good chance that it&#8217;ll just sit on my external hard drive and collect digital dust. But who knows. (p.s. <a href="http://programminghistorian.org/lessons/automated-downloading-with-wget">you can check wget out at the <em>Programming Historian</em></a>)</p>
<p>This command, which you can see broken down on the Internet Archive blog, is downloading an entire item list for me &#8211; just the text files and some of the images.</p>
<p><code>wget -r -H -nc -np -nH --cut-dirs=2 -t 1 -A .txt,.jpg -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'<br />
</code></p>
<p>You could use that to download cookbooks en masse, or Statistical reports, or really anything you might want.</p>
<p>- I discovered a new OS X command, <a href="http://ss64.com/osx/textutil.html"><strong>textutil</strong></a>. There&#8217;s probably a reason I&#8217;ve never heard of it before, but it worked on some Lynx-formatted web archives where <a href="https://github.com/aaronsw/html2text">html2text</a> balked. This is probably more my fault than html2text&#8217;s, but it&#8217;s still nice to have a built-in terminal command for this stuff.</p>
<p>- Also, the coffee over at <a href="http://engsoc.uwaterloo.ca/services/c-d">Engineering&#8217;s appropriately named &#8220;Coffee and Donuts&#8221;</a> has more caffeine than the President&#8217;s Choice stuff I normally drink here. Gah!</p>
<p>Come Monday morning, I&#8217;ll be back at the main projects: wrangling these articles together, preparing a conference presentation, responding to the e-mails that are stacking up, and so forth. But for the rest of the day, it&#8217;s play time.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1352/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1352/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1352&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/05/10/an-aside-an-ode-to-exploratory-research/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>
	</item>
		<item>
		<title>NCPH Panel: &#8220;Reaching the Public through the Web: The Practice of Digital Active History&#8221;</title>
		<link>http://ianmilligan.ca/2013/04/17/ncph-panel-reaching-the-public-through-the-web-the-practice-of-digital-active-history/</link>
		<comments>http://ianmilligan.ca/2013/04/17/ncph-panel-reaching-the-public-through-the-web-the-practice-of-digital-active-history/#comments</comments>
		<pubDate>Wed, 17 Apr 2013 18:16:25 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1344</guid>
		<description><![CDATA[On Friday morning at 8:30am (!), I&#8217;m looking forward to giving brief presentation alongside several of my colleagues at the NCPH annual meeting. In our panel, &#8220;Reaching the Public through the Web: The Practice of Digital Active History,&#8221; we will &#8230; <a href="http://ianmilligan.ca/2013/04/17/ncph-panel-reaching-the-public-through-the-web-the-practice-of-digital-active-history/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1344&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-17-at-2-13-06-pm.png"><img class="alignright size-medium wp-image-1345" alt="Screen Shot 2013-04-17 at 2.13.06 PM" src="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-17-at-2-13-06-pm.png?w=300&#038;h=85" width="300" height="85" /></a>On <span style="text-decoration:underline;">Friday morning at 8:30am</span> (!), I&#8217;m looking forward to giving brief presentation alongside several of my colleagues at the <a href="http://ncph.org/cms/conferences/2013-annual-meeting/">NCPH annual meeting</a>. In our panel, &#8220;Reaching the Public through the Web: The Practice of Digital Active History,&#8221; we will be providing several different perspectives on, well, how to reach the public through the Internet. Our session abstract puts it eloquently (I didn&#8217;t write it):</p>
<blockquote><p>Active history is history that listens, is responsive and encourages a broad range of forms of public engagement. As the accessibility and volume of digital content increases, so do possibilities for digital outreach activities. These opportunities bring challenges, benefits, and new methods of approaching the past. This panel focuses on the intersection of active history and digital technologies; with an emphasis on community involvement, alternate reality games, digital vs. physical engagement, and the possibilities of engaging disparate audiences.</p></blockquote>
<p>What will we be talking about?<span id="more-1344"></span></p>
<p>I may be kicking things off with an introduction to the website <a href="http://activehistory.ca/">ActiveHistory.ca</a>, highlighting some of our successes, challenges, and the errors we made along the way. Engaging with the public isn&#8217;t just as simple as throwing up your shingle. I&#8217;m going to argue that <strong>&#8220;blogging can and needs to be considered a central component of a university-based historian who wants to engage the public.&#8221;</strong> We&#8217;ll see if I&#8217;m preaching to the crowd.</p>
<p>Things will get more interesting as others move beyond my often utopian visions of the Internet and provide their perspectives from perspectives of very involved public historians! <strong><a href="http://krista-mccracken.blogspot.ca">Krista McCracken</a></strong> (an editor at ActiveHistory.ca) is going to be drawing on a case study from some very moving residential school photographs, talking about the shift from the &#8220;real world&#8221; to &#8220;digital space.&#8221; What challenges are presented? How can you engage with communities digitally? The question of audience?</p>
<p><a href="http://devonelliott.net"><strong>Devon Elliott</strong></a>, of <a href="http://dhmakerbus.com">DH Maker Bus</a> fame <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> , is looking at DIY in the sense of all the making he does! What does it mean to make objects that speak to the past? What do they do at the <a href="http://williamjturkel.net/2013/02/02/the-history-department-with-a-fab-lab/">Fab Lab at Western</a>? How can there be an Active History beyond text? Devon came to speak to my class at Waterloo, and I&#8217;m not sucking up to him when I note that it was an overwhelming favourite &#8216;hit&#8217; of the course.</p>
<p>And then <a href="http://tpeace.wordpress.com"><strong>Tom Peace</strong></a> &#8211; who helped come up with the original Active History idea with <a href="http://www.jimclifford.ca">Jim Clifford</a> and a few others way back in 2008 &#8211; highlights the issue with just relying on the web. As an engaged community member, leading Jane&#8217;s Walks, grasping the socio-economic complexities of cities like his hometown of Hamilton, Ontario, Tom brings a good cautionary note. Whenever I&#8217;m getting too utopian, he&#8217;s usually there to bring reality to bear.</p>
<p>All of this is moderated by <a href="http://www.unbc.ca/history/faculty">Nathan Smith from the University of Northern British Columbia</a>, who co-leads the Active History Canadian Historical Association group and is a really engaged scholar who brings it together.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1344/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1344/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1344&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/04/17/ncph-panel-reaching-the-public-through-the-web-the-practice-of-digital-active-history/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-17-at-2-13-06-pm.png?w=300" medium="image">
			<media:title type="html">Screen Shot 2013-04-17 at 2.13.06 PM</media:title>
		</media:content>
	</item>
		<item>
		<title>An Aside: &#8220;An Academic with Impostor Syndrome&#8221;</title>
		<link>http://ianmilligan.ca/2013/04/16/an-aside-an-academic-with-imposter-syndrome/</link>
		<comments>http://ianmilligan.ca/2013/04/16/an-aside-an-academic-with-imposter-syndrome/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 13:23:49 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Asides]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1332</guid>
		<description><![CDATA[In academia, we talk a lot about the stresses of the job. A lot of this, for me, comes down to these two must-reads: Joseph Kasper&#8217;s &#8220;An Academic With Imposter Syndrome&#8221; in the Chronicle of Higher Education and Aimée Morrison&#8217;s &#8230; <a href="http://ianmilligan.ca/2013/04/16/an-aside-an-academic-with-imposter-syndrome/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1332&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In academia, we talk a lot about the stresses of the job. A lot of this, for me, comes down to these two must-reads: <a href="http://chronicle.com/article/An-Academic-With-Impostor/138231/">Joseph Kasper&#8217;s &#8220;An Academic With Imposter Syndrome&#8221;</a> in the <em>Chronicle of Higher Education</em> and <a href="http://www.hookandeye.ca/2013/04/academic-imposter-syndrome.html">Aimée Morrison&#8217;s &#8220;Academic Impostor Syndrome&#8221;</a> in the always great blog <em>Hook and Eye</em>.</p>
<p>It&#8217;s funny, as we often tend to put forward imposing, confident faces on social media: the researcher, teacher, and administrator with his/her stuff together, progressing forward. Even in a few months where I&#8217;ve designed and taught two new courses while moving a pretty aggressive research agenda forward, that voice is always there: &#8220;Is this enough?&#8221; &#8220;Shouldn&#8217;t you work just a bit later on the weekend?&#8221; &#8220;How dare you take a break when others fight for full-time work?&#8221;</p>
<p>These articles helped remind me that I&#8217;m not alone. Hopefully others find them helpful too.</p>
<p>Back to work, though! Tomorrow, I&#8217;m off to Ottawa for <a href="http://ncph.org/cms/">NCPH 2013</a>, and am looking forward to a glass of wine on the train.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1332/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1332/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1332&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/04/16/an-aside-an-academic-with-imposter-syndrome/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>
	</item>
		<item>
		<title>Topic Modelling Review Up</title>
		<link>http://ianmilligan.ca/2013/04/11/topic-modelling-review-up/</link>
		<comments>http://ianmilligan.ca/2013/04/11/topic-modelling-review-up/#comments</comments>
		<pubDate>Thu, 11 Apr 2013 16:42:31 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Digital History]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1313</guid>
		<description><![CDATA[The new issue of the Journal of Digital Humanities is out, and I&#8217;m really happy to be &#8211; along with Shawn Graham of Carleton University &#8211; a contributor. In this issue, we&#8217;re reviewing the Java-based topic modelling tool MALLET (the MAchine Learning &#8230; <a href="http://ianmilligan.ca/2013/04/11/topic-modelling-review-up/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1313&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-11-at-12-39-28-pm.png"><img class="alignright size-medium wp-image-1314" alt="Screen Shot 2013-04-11 at 12.39.28 PM" src="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-11-at-12-39-28-pm.png?w=300&#038;h=261" width="300" height="261" /></a>The new issue of the <em><a href="http://journalofdigitalhumanities.org">Journal of Digital Humanities</a> </em>is out, and I&#8217;m really happy to be &#8211; along with <a href="http://electricarchaeology.ca">Shawn Graham of Carleton University</a> &#8211; a contributor. In this issue, <a href="http://journalofdigitalhumanities.org/2-1/review-mallet-by-ian-milligan-and-shawn-graham/">we&#8217;re reviewing the Java-based topic modelling tool MALLET</a> (the MAchine Learning for LanguagE Toolkit). I&#8217;m really flattered to be part of the issue, and glad to see several links to our <a href="http://programminghistorian.org/lessons/topic-modeling-and-mallet">free, open access guide to how to use MALLET</a> in the <em>Programming Historian 2</em> throughout.</p>
<p>Topic modelling, as the review and <em>Programming Historian 2</em> piece notes, have been critical parts of my digital toolkit over the last year or so. It&#8217;s enabled me to make cursory explorations of very large datasets, from musical lyrics, Canadian history dissertations, dead websites, and even more. Some of the initial euphoria around it being a &#8216;magic bullet&#8217; for everything have worn off (honestly, the first time you try it, you&#8217;ll be hopping off the walls) but it&#8217;s an incredible program to use.</p>
<p>If you ever want to chat topic modelling, shoot me an <a href="mailto:i2milligan@uwaterloo.ca">e-mail</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1313/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1313/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1313&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/04/11/topic-modelling-review-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>

		<media:content url="http://ianmilli.files.wordpress.com/2013/04/screen-shot-2013-04-11-at-12-39-28-pm.png?w=300" medium="image">
			<media:title type="html">Screen Shot 2013-04-11 at 12.39.28 PM</media:title>
		</media:content>
	</item>
		<item>
		<title>Saving History: The Eternal Responsibility of the Historian</title>
		<link>http://ianmilligan.ca/2013/04/08/saving-history-the-eternal-responsibility-of-the-historian/</link>
		<comments>http://ianmilligan.ca/2013/04/08/saving-history-the-eternal-responsibility-of-the-historian/#comments</comments>
		<pubDate>Mon, 08 Apr 2013 11:45:48 +0000</pubDate>
		<dc:creator>Ian Milligan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ianmilligan.ca/?p=1309</guid>
		<description><![CDATA[I gave my last Digital History class the day after posting my last post, about how Yahoo! has probably destroyed the most history, well, ever. I was motivated to write that post for two reasons: (1) anger that primary sources &#8230; <a href="http://ianmilligan.ca/2013/04/08/saving-history-the-eternal-responsibility-of-the-historian/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1309&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I gave my last Digital History class the day after posting my last post, about how <a title="Yahoo! Has Probably Destroyed the Most History, Ever – And Historians Need to Wake Up" href="http://ianmilligan.ca/2013/04/03/yahoo-sucks-historians-wake-up/">Yahoo! has probably destroyed the most history, well, ever</a>. I was motivated to write that post for two reasons: (1) anger that primary sources could be so cavalierly destroyed, in this era of falling storage costs; and (2) a realization that something like the geocities archive if of invaluable utility for a historian even today.</p>
<p>It made me realize <strong>how much history has probably been written because somebody didn&#8217;t throw something out</strong>. <a href="http://history.uwo.ca/People/Faculty/Vance.html">Jonathan Vance</a>, a historian at Western University, said as much at a keynote I attended last month &#8211; that his greatest fear was garbage day. The war records lost when somebody passes away and a son or daughter doesn&#8217;t know what they are. All those diskettes and hard drives that might just get tossed because &#8220;who cares about that old data.&#8221; The invaluable building blocks of a future social historian.</p>
<h3><strong>So I laid down a challenge to my students</strong>: that they were historians <strong>now</strong>, and that they&#8217;d always be historians.<span id="more-1309"></span></h3>
<p>And that would be true if they didn&#8217;t go into academia, as most will not. They will still be historians if they go into the private sector, if go into primary/secondary education, if they start working for the Ministry of Education.</p>
<p>And, probably getting increasingly animated as I paced around our lecture room, I tried to explain what that means to me. That beyond the usual things that we know a history education gets them (great reading/writing skills, an ability to evaluate competing narratives, evaluate sources, cut through the bullshit, etc.), that there was a responsibility &#8211; a responsibility to always <strong><span style="text-decoration:underline;">think like a historian</span></strong>.</p>
<p>Because I was wondering &#8211; what if a historian had been on the senior staff at Yahoo! What if a historian had been part of Google when <a href="http://news.softpedia.com/news/All-Videos-on-Google-Video-Will-Be-Deleted-Next-Month-195479.shtml">they planned on deleting Google Video</a>? (And by this I don&#8217;t mean somebody who might happen to have a History BA, picked up along the way &#8211; but somebody who thinks <span style="text-decoration:underline;">like a historian</span>)</p>
<p>That they should (and would, I argue):</p>
<ul>
<li><span style="line-height:13px;"><strong>Question the destruction of sources</strong>: raise a flag when somebody suggests shredding some documents, or throwing them out en masse;</span></li>
<li><strong>Think ahead to the future, especially when thinking of the digital</strong>: keep those old disks, keep those hard drives, keep backing up and making available to future generations (and to themselves, should they want to look back on their own past down the line);</li>
<li><strong>Keep it Open; </strong>I was so happy to see several of my students slapping CreativeCommons licenses on their work, and moving out of the realm of Facebook and into websites. Facebook might not be preserved, but their WordPress site certainly will &#8211; as will the data;</li>
<li><strong>If they see something important, keep it or at least be mindful of its potential. </strong>I made a joke about how it might not even be legal. Obviously, I don&#8217;t want my students breaking the law. How much of our history, especially around the contentious grey areas of our past, is possible because somehow something became available. Whether its advocating for declassification of materials, or in some cases, maybe even saving something for the distant future&#8230;. I dunno.</li>
</ul>
<p>I guess my challenge to them was that they&#8217;re hopefully going to get a lot out of a history degree. And that they <span style="text-decoration:underline;">are historians</span>, forever and ever. You don&#8217;t need a PhD to be a historian, let alone a BA. It&#8217;s a frame of mind, and one part of that ought to be preserving sources.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianmilli.wordpress.com/1309/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianmilli.wordpress.com/1309/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=ianmilligan.ca&#038;blog=1001957&#038;post=1309&#038;subd=ianmilli&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://ianmilligan.ca/2013/04/08/saving-history-the-eternal-responsibility-of-the-historian/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/ea9ce7cb1208469815a91c83a35b83ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ianmilligan1</media:title>
		</media:content>
	</item>
	</channel>
</rss>
