Using the Internet Archive and Voyant in my Workflow: Early Internet Forums

In one of my projects, I have been using early Internet forums. In a nutshell, there were early CRTC hearings on regulating New Media. In collaboration with the University of Toronto’s McLuhan Program in Culture and Technology, the CRTC operates a web forum through 1998 to solicit opinions on what Canadians thought. While not as large as today’s web forums, it was a considerable amount of feedback: over five hundred messages. It’s not the biggest part of my paper, but it was a good opportunity to explore using the Internet Archive as a historian.

So how do we fruitfully access these?

An advertisement for the CRTC's survey, pointing users towards http://www.newmedia-forum.net.

An advertisement for the CRTC’s survey, pointing users towards http://www.newmedia-forum.net.

Firstly, I needed to find them. Luckily, through conventional research, I found the web URLs of where the CRTC encouraged people to visit (using the Globe and Mail’s online portal, which has its own pros and cons). So using the WayBackMachine, I went there. You could, of course, find this forum from several other source including CRTC primary documents, still-preserved online discussions, and Electronic Frontier Canada websites.

Screen Shot 2013-02-09 at 11.27.52 AMSecondly, we then find it in the WayBackMachine. While the website is today lost, it was preserved then. Essential for any historian wanting to study this period. But it’s not too user friendly: the website seems archaic, information is spread across several pages, and I prefer to have it all on my home system to quickly access.

So I wrote a quick program in Mathematica to scrape these, although you could adapt code and experiment with Python using the Programming Historian 2 to do the same. If any of you are using Mathematica, send me an e-mail and I can share the notebooks with you. I’ll also paste the code below. In a nutshell, we scrape the hyperlinks from each forum page (nineteen of them in one case, nine in another), learn that the forum posts have a particular format (in this case forum00123.html), and then pull them down in plaintext. We build pauses in and try to be as unobtrusive as possible.

Then I have a few hundred text files in a directory, each an individual forum post. To explore them, I build singular files for each in my terminal (on OS X). Navigate to the directory where they all are and build one big text file, using the command cat *.txt >> forum-ALL.txt. This useful command just brings everything together.

Then I copy that text file, paste it into Voyant Tools, and viola – a great way to navigate an entire Internet forum circa 1998. We see trends, big topics, and a quick way to move throughout it rather than dealing with all these individual files. Since it’s born digital, we have a pretty effective workflow, and we can apply it elsewhere.

Navigating the forum with voyant-tools.org.

Navigating the forum with voyant-tools.org.

Mathematica code for scrape

Note that I’m using loops here. As I like to build pauses in and scrape fairly unobtrusively, this seems like the best way forward. It’s not pretty, but it does the job.

baseurl =
"http://web.archive.org/web/20001026021138/http://newmedia-forum.\
net/forum/mail";
suffixlist = Range[2, 19, 1];
end = ".html";

links = {};
Do[
url = baseurl ToString[suffix] end;
AppendTo[links, Import[url, "Hyperlinks"]];
Pause[2];
, {suffix, suffixlist}];

res = StringCases[Flatten[links],
Shortest["/forum0" ~~ ___ ~~ ".html"]];


SetDirectory["/users/ianmilligan1/desktop/crtc/new-media-forum/"];
prefix = "http://web.archive.org/web/20001026021138/http://newmedia-
forum.net/forum/";
Do[
var = Flatten[res][[x]];
urlCall = prefix var;
exportfile = StringTake[var, {2, 11}] ".txt";
text = Import[urlCall, "Plaintext"];
Export[exportfile, text];
Pause[2];
, {x, Range[1, Length[Flatten[res]], 1]}];

2 thoughts on “Using the Internet Archive and Voyant in my Workflow: Early Internet Forums

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s