English language entity recognition relies in part on recognizing proper nouns. There are a number of indicators of what is a proper noun, such as sentence position, or the types of words around the proper noun, but capitalization is normally a very good clue.
We do the same thing when we read – in the sentence “I’ve rarely met a coffee hound like Tim Mohler” our eyes instantly pick out the capitals. There are many places where capitalization breaks down, though, such as Twitter, or in this case, headlines.
Many news sources capitalize all nouns in a headline, or even all words, which brings us to our Fail of the Month:
“Orlando Bloom And Miranda Kerr Split”
Salience lists two entities:
- Orlando Bloom
- Miranda Kerr Split
Simply avoiding parsing the headline doesn’t work very well, since many articles contain important information in the headline not repeated in the body of the article. What to do? In this case, Salience provides a solution – when you have pieces of text like this where you know capitalization might be wonky, you can use an option in your call to Salience to flatten all capitals, giving us:
“orlando bloom and miranda kerr split”
Which in turn gives us:
- orlando bloom
- miranda kerr
Without the errant capitalization, Salience actually has an easier time identifying which of these are proper nouns.