Using Mathematica to Generate Plain Text Files of Mirrored Web Site Archives

I’ve been spending a bit of time playing with the GeoCities Torrent again (in anticipation of being able to compare it to some of the WARC Files), with an eye on a few articles. One script in particular, derived from an old StackOverflow question I asked three years ago, has been extremely helpful. In short, given a directory of websites like so:

/users/me/webarchive/www.geocities.com/enchantedForest/1002/index.html
/users/me/webarchive/www.geocities.com/enchantedForest/1003/index.html
/users/me/webarchive/www.geocities.com/enchantedForest/1004/index.html

And all of the other files – gifs, html, subdirectories, nested subfolders to nested subfolders, I often just want to play with some of the plain text! Luckily, Mathematica has a powerful Import command that can convert HTML into generally good human-readable plain text. For some of my clustering, topic modelling, etc., where I don’t want links or stylistic information, this is really useful.

This isn’t for standard web archives, but more for the torrents, end-of-life dumps, and other things that I find myself working with when I’m playing in the wild.

It’s simple, and I use it a ton, so figured this might help you.

The modules:

mapFileNames[source_, filenames_, target_] := 
 Module[{depth = FileNameDepth}, 
  FileNameJoin[{target, FileNameDrop[#, depth]}] & /@ filenames]

htmlTreeToPlainText[source_, target_] := 
 Module[{htmlFiles, textFiles, targetDirs}, 
  htmlFiles = FileNames["*.html", source, Infinity]; 
  textFiles = 
   StringReplace[mapFileNames, 
    f__ ~~ ".html" ~~ EndOfString :> f ~~ ".txt"]; 
  targetDirs = DeleteDuplicates[FileNameDrop[#, -1] & /@ textFiles]; 
  If[FileExistsQ[target], 
   DeleteDirectory[target, DeleteContents -> True]]; 
  Scan[CreateDirectory[#, CreateIntermediateDirectories -> True] &, 
   targetDirs]; 
  Scan[Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &, 
   Transpose[{htmlFiles, textFiles}]]]

And then you run it by setting two variables:

origin=directory where the directory structure is

and

target=directory where you want the replicated plain-text directory structure to reside.

and run it:

htmlTreeToPlainText[origin, target];

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s