An Aside: Frequency Counting and Removing Stopwords with Bash

This is a pipeline cobbled together from William J. Turkel’s “Basic Text Analysis with Command Line Tools in Linux” and an adaptation of the StackOverflow discussion “Shell to Filter Prohibited Words on a File.” I have also been finding “Sort Files Like a Master with the Linux Sort Command (Bash)” extremely helpful.

If you haven’t checked out Turkel’s lessons, you should do so now. Consider this your advertisement!

The following command takes what Bill does in his class, and basically bundles it all together to an even more extreme level than the pipeline example he gave. It takes a text file, cleans up punctuation, normalizes it to lower case and generates out word frequency:

tr -sc 'A-Za-z' '\12' < input.txt | tr -d [:punct:] | tr [:upper:] [:lower:] | tr -d '\r' | sort | uniq -c | sort -nr > output_ngram.txt

This then takes it and puts it through an adapted stop word removal program. Stopwords-flat is a text file that has all my English language stopwords on one line (so they are separated only by spaces, rather than new lines).

awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($2 in w))' stopwords-flat.txt output_ngram.txt > results.txt

The results then have word frequency with the stopwords stripped out!

Anyways, I spent a while monkeying with this in the afternoon and figured in case somebody else was working on this, they might find it useful.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s