Tuesday, February 11, 2014

Finding Named Entities using R

Occasionally, I'll need to pick out names (first name, last name) from text. These days, the text I'm working with is usually tweets. Any how, I didn't see any solution out there (that worked for me) when I developed this, so hopefully it can be a starting point for somebody else with similar needs...

First, I start out with a list of names from the census bureau. I downloaded male first names, female first names, and last names and same them as variables in R. I do take out some of the names as "exceptions" that screw up my process here. Names like "In", "An", "Chi", "So", and so on.

Then, I split my target text up into bigrams, that is, adjacent pairs of words in the original text...

This returns every pair of words in the tweet. From here, I can look through each of these bigrams for names. To make the search for names a little easier, I throw out any bigrams that don't have capital letters for the first and last names.

Now that I have a list of bigrams in which both words start with capital letters, I can compare the words to the name list to see if they are names. I start with the last name. If the second word in the bigram doesn't appear in the last name list, we can stop... there's no need to check the first name. If the second word is a last name, then we check the first word against the first names list. If both of these check out, we have ourselves a name. Here's the code for that...

The full code for this can be found here... https://github.com/corynissen/cook-county-tweet-dashboard/blob/master/cctweets/findNames.R

I have tried the openNLP package for this and couldn't get it to work reliably and quickly, so I made my own. If you have any suggestions on how to do this better, let me know!

Follow me on twitter... https://twitter.com/corynissen

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.