Exploring the USENET Archive: Early Thoughts

USENET was a “worldwide distributed Internet discussion system,” as Wikipedia notes, that needs to be understood as an unparalleled and fascinating source of historical information. Beginning in 1980, users could post individual messages to an ever-increasing array of newsgroups, sharing pictures as binary files, and engaging in wide-ranging and extensive international conversations. My first forays onto USENET took place in 1996 (when I was 11 years old – I wrote well, but man, was I a geek – and was using my dad’s, Peter, e-mail address) so I’m now, as an adult, curious about what I as a now-adult historian can do with this resource.

Think about it: we use letters to the editors, newspaper articles, cultural commentaries, etc., in our work. But here we have a centralized record of discussions of Canadians for decades. Sure – it’ll skew male, white, and middle-class (which is backed up by Canadian government reports on ‘net users) – but so do most sources.

This is just a first stab in the dark.

A USENET Sitemap c. 1981. Click through for the original post.

From DejaNews to Google Groups: Modern Ways to Search this Material

The first big archive of USENET postings appeared in 1995 as Deja News, the functionality of which died by 2001, and was implemented into the Google Groups search engine shortly thereafter. In Google Groups, you can search for historic terms and extract specific messages and threads from USENET going back to 1981.

Unfortunately, Google Groups isn’t perfect. Its default ‘new’ user interface is ghastly, lacking advanced search functionality or any means to navigate large bodies of information. You can still use the legacy interface (the ‘old Google Groups,’ which you’ll be continuously reminded will disappear) but it can often be hard to navigate to specific date ranges, topics, and hoping that you can keep it ordered in the proper chronological fashion you might want (i.e. from oldest to newest, as opposed from newest to oldest). But still, its Advanced Search is incredibly useful. Perhaps more significantly to many of my readers, it’s also online and thus locked away from a lot of analytical tools we might want to use (that said, with the old interface and a specific question, I used Mathematica to scrape an entire array of messages relating to the Internet Archive).

David Wiseman with the 141 magtapes from Henry Spencer's USENET collection. C — David Wiseman with the 141 magtapes from Henry Spencer’s USENET collection. Click here for the full story.

A Massive Archive: The USENET Archive of UTZOO Tapes

Wanting to experiment with another form of Big Data, as opposed to specific keyword searching through Google Groups, I turned to the Internet Archive. Here we have the “Usenet Archive of UTZOO Tapes (December 11, 2001).” Henry Spencer, who had been in charge of the University of Toronto’s Zoology department’s computing systems (hence UTZOO), created an archive of some 2.1 million messages, stored on magnetic tapes. They’re now largely mirrored and available at the Internet Archive, where you can download individual compressed files of each tape or you can torrent the entire collection. It’s 2.1GB compressed, over 7GB uncompressed.

A lot of text.

Luckily a torrent can come to our rescue.

Files are arranged by tape (141) of them, and then with varying hierarchies (replicating the USENET hierarchy – i.e. tor/news for a message in the Toronto News group). It takes some monkeying around, but you can have it all on your system pretty quickly.

From Clump of Data to Something I Can Work With.

My first step, right now, as I start to play with this material is to build a sub-set of USENET messages from specifically Canadian groups. Canadian groups don’t really appear until 1983, so you’re missing two years of the USENET archive, and even more critically, a ton of Canadians are obviously having conversations outside of these specific groups. Still, it’s a start and a good training set.

A list of Canadian USENET groups:

{“ab”,”acadia”,”acs”,”bc”,”bison”,”brocku”,”cabot”,”calgary”,
“can”,”carleton”,”concordia”,”cs”,”dal”,”edm”,”eye”,”flora”,
“govonca”,”hamilton”,”hfx”,”hookup”,”hum”,”inforamp”,”interlog”,
“kingston”,”kw”,”laurentian”,”man”,”mcgill”,”mcmaster”,”mtl”,
“mun”,”nanaimo”,”nb”,”ncf”,”nf”,”niagara”,”ns”,”nwt”,”ont”,
“ott”,”pei”,”pnw”,”qc”,”queens”,”redarmy”,”rye”,”sfu”,”simcoe”,
“sj”,”sk”,”socs”,”stmarys”,”sudbury”,”tor”,”torfree”,”trentu”,
“tvontario”,”ualberta”,”ubc”,”ucalgary”,”udes”,”ulaval”,
“umoncton”,”umontreal”,”unb”,”uqam”,”usask”,”ut”,”uvic”,”uw”,
“uwindsor”,”uwo”,”van”,”vic”,”vifa”,”vikingis”,”wimsey”,”wlu”,
“wpg”,”yk”,”york”};

As a historian, I really want to deal with dates and would like the option to group various areas together, with messages arranged by date. So I wrote a short program to open up each message, extract the Date: field that each has, and renaming the file based on the exact time it was posted.

A few preliminary results:

Topic modelling is fruitful, but I will have to further refine my data to keep header data separate from the file itself. Still, as the image demonstrates, there is some data: increasing concerns around taxes, some drop off on computer topics, and spikes around the free trade debate and otherwise.

Keyword searching should be useful too, although not sure what this will get from me.

I’ll have to extract names and, probably most importantly, e-mail addresses to see (a) how many users do we have; (b) can we reconstruct a network of the Canadian USENET?

All of this is to start thinking about how we can use this unique source as a social, cultural, and even political record of the past.

TOPIC MODELS: