Apache Tika: A New Addition to my Toolkit?

The GUI for Apache Tika.

Continuing to work in public, now with some other workflow decisions I’m making. Also, hopefully people find this useful when thinking about various tools humanists might want to play with.

I’ve been dealing with a bunch of Internet Archive data, and want to be able to access it relatively quickly. One solution has been to use Apache Solr. Solr, however, likes to work with XML files (there are workarounds, but none seemed terribly satisfactory). Plus if I could generate XML files with good metadata….

Enter Apache Tika. Frankly, as a digital humanist, I was a bit surprised I hadn’t heard much about it through my usual networks as it’s right up our collective alley: “The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.”

Digital humanists love metadata, right? And structured data is fantastic! So let’s delve into this suite and see what we can find.

For me, I want to have a tool that’ll turn my arrays of text files and HTML into XML. Some bonus features include language detection (I primarily work in English), this might replace some of my earlier use of textutil, and document detection can be handy to. So let’s set it up.

Setting Up Apache Tika

The ‘Getting Started’ instructions are generally quite good. I had some difficulty with the files provided working properly, but http://archive.apache.org/dist/tika/apache-tika-1.3-src.zip worked perfectly. Here were my commands:

in my user directory:

mkdir tika

cd tika

wget http://archive.apache.org/dist/tika/apache-tika-1.3-src.zip

unzip -q apache-tika-1.3-src.zip

mvn install

It should work relatively quickly. In the directory ~/tika-1.3/tika-app/target is the java app file, tika-app-1.3.jar. Launch the GUI to make sure it works:

java -jar tika-app-1.3.jar

A GUI should launch, and you can tinker around with stuff by dragging and dropping various files into it. But the real potential is in the command line.

Using it On Files

The GUI is nice, but not my cup of tea for a lot of work. We can do some automation on the command line, however.

Let’s take a text file that I’ve scraped, with a nonsensical title (42133.txt) from the Dictionary of Canadian biography.

BORDEN, Sir ROBERT LAIRD , lawyer and politician; b. 26 June 1854 in Grand Pré, N.S., first child of Andrew Borden and Eunice Jane Laird; m. 25 Sept. 1889 Laura Bond (d. 8 Sept. 1940) in Halifax; they had no children; d. 10 June 1937 in Ottawa.

Now let’s run the following command:

java -jar /users/ianmilligan1/solr/tika-1.3/tika-app/target/tika-app-1.3.jar -x -r 42133.html.txt > output.xml

We now have XML data:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<meta name="Content-Length" content="74491"/>
<meta name="Content-Encoding" content="UTF-8"/>
<meta name="Content-Type" content="text/plain; charset=UTF-8"/>
<meta name="resourceName" content="42133.html.txt"/>
<p>BORDEN, Sir ROBERT LAIRD , lawyer and politician; b. 26 June 1854 in Grand Pré, N.S., first child of Andrew Borden and Eunice Jane Laird; m. 25 Sept. 1889 Laura Bond (d. 8 Sept. 1940) in Halifax; they had no children; d. 10 June 1937 in Ottawa.</p>

It’s in English, too. Running:

java -jar /users/ianmilligan1/solr/tika-1.3/tika-app/target/tika-app-1.3.jar -l 42133.html.txt

reveals en. Could be handy.

We can also run this on regular HTML data, such as what I’ve scraped from DCB before turning it into text files. I won’t reprint content, but the metadata is similarly handy:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<link rel="stylesheet" href="CSS/Search.css"/>
<meta name="Content-Length" content="78834"/>
<meta name="Content-Encoding" content="windows-1252"/>
<meta name="Content-Type" content="text/html; charset=windows-1252"/>
<meta name="resourceName" content="42133.html"/>
<meta name="dc:title" content="Dictionary of Canadian Biography"/>
<title>Dictionary of Canadian Biography</title>
<tr> <td>
<tr> <td>

All of these commands can be built into scripts or other workflows, so I now have a handy way to create decent XML out of stuff I’m pulling down out of the Internet Archive for example.

One thought on “Apache Tika: A New Addition to my Toolkit?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s