Quick Gender Detection Using Wolfram|Alpha

I had a question last night while navigating a 1.6 million word corpus of Guestbook entries, with approximately 7,000 entries: what were the genders of people leaving guestbook entries? This was exploratory on my behalf, and I wanted to roughly know relatively quickly. In the past, this is the sort of question that might take months (or the laborious efforts of a Research Assistant) to figure out. This took about five minutes of coding and we get a rough answer.

If you remember from my post on Wolfram|Alpha, (as well as a later one about the API for you non-Mathematica users) it has name detection built in. Give it a name and it will tell you whether it’s female or male, as well as a host of other statistics around the popularity, the number of births a year in the United States, estimated total alive, rank, and even the most common age!

In Mathematica, then, we can query a name from a list with the following command:

res=WolframAlpha["name kelsey",{{"Input",1},"Plaintext"}]

Which responds with:

"Kelsey (female given name)"

Now, I use this example intentionally to show one of the major downsides. Kelsey is predominantly a female name, but it’s also a male name: think Kelsey Grammer. So this is not going to be precise, but it’s going to close enough for a quick pass.

After extracting the guestbooks into plain text, I scraped the two words following “Name” in my trigram database. In Mathematica, that’s as simple as Cases[trigrams,{"name",_,_}][[All,2]] for first names, and Cases[trigrams,{"name",_,_}][[All,3]] for last names.

So then a quick loop with counters will suffice, feeding those firstnames through that code – so instead of “Kelsey” we put in a host of other names that we’ve scraped from guestbooks. This is a bit sloppy, but it works – female increases every time it finds a female, male whenever it finds a female, and if it doesn’t meet a criteria it goes to ‘Nil’.

female=0;
male=0;
nil=0;
Do[
  query="name "word;
  res=WolframAlpha[query,{{"Input",1},"Plaintext"}];
  If[StringMatchQ[res,"*(female given name)*"],female=female+1,nil=nil+1];
  If[StringMatchQ[res,"*(male given name)*"],male=male+1,nil=nil+1];
  ,{word,firstnames}];

And the results? 2,949 females and 1,161 male names. That means that there were a lot of names not captured, and when looking into why, we see a lot of avatars and usernames. But still, interesting at first pass. Were more females than males leaving comments? Or were males disposed to use more nicknames?

Next I can scrape and see what the average age of people leaving comments in 1999 would have been, based on names. Again, loosey-goosey, but hopefully provocative.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s