Named Entity Extraction

Named Entity ExtractionLexalytics' text analysis software supports five different methods for extracting named entities from your text.  Named entities are important in that our software will then go on to calculate other metrics to go along with the entity - things like the entity's sentiment analysis score, or the associated themes.

Salience has been trained to automatically start extracting People, Places, Companies, Products, URLs, Phone Numbers, and more.   You don't have to configure anything.  However, there is a ton that you *can* configure if you want to improve upon what the named entity recognition system does automatically.

 

Named Entity Recognition

 

 

 

 

 

 

 

 

 

 

 

 

These methods for extracting named entities are as follows:

  • Lists: simple lists of named entities, say a list of car manufacturers, or a list of people, or a list of trees
  • Patterns: part of speech patterns that allow you to define things like noun phrases to be treated as named entities, or define particular verb phrases, or whatever part of speech patterns that matter to you
  • Regular Expressions:  Using the Boost RegEx library, you can define things like phone numbers, date ranges, email address types and have them treated as named entities
  • CRF Model:  Lexalytics provides a pre-trained Conditional Random Field model that already recognizes the following named entity types:  Person, Date, Place, Company, Product, Job Title
  • MaxEnt Model:  If you need to train a model to recognize things like "types of cancer", we provide the tools for you to do that with our Maximum Entropy-based model.

Note that all of these extraction types operate simultaneously, with priority given to the simpler techniques (meaning, if you put "Lexalytics" as a company in a list, and the model is only 80% certain that Lexalytics is a company, the text analysis software is going to be 100% certain it has picked out Lexalytics as a company - because you put it in a list).

When the named entity object is returned, it has certain parameters associated with it, as below:

Entities Returned Information

 

 

 

 

 

 

Named entities are always returned in a normalized fashion, so, you could have all of the different ways that a name is expressed (President Obama, Obama, Barack Obama, Barack Hussein Obama) all get normalized to a single name:  President Barack Obama (or whatever you chose for it).

Mentions is, well, the number of times the named entity is mentioned in the content.

Labels are user assignable -- consider it to be a sub-category of named entity type. For example, if you have an entity type of "company", you could have labels of "customer", "competitor", "partner", "supplier"; or whatever you choose.

Sentiment is the tone for that specific named entity in that document. Sentiment analysis is returned as a positive/negative magnitude that represents the amalgamation of all mentions of the named entity in the content. We do not decide the range of "positive/negative/neutral" - that's up to you in your sentiment analysis tools, we just say whether it's positive or negative. To be more clear, if the sentiment is +0.1, that's probably neutral. However, if it's +1.1, then it's certainly positive. Setting the boundaries of neutral is up to the individual implementation, though we recommend a boundary of -0.2 to +0.2.

Evidence is measured on a scale of 1 to 7. The higher the number, the more sentiment-bearing phrases we could associate with the named entity, and thus, the greater our confidence in the sentiment analysis. Some of our customers disregard this number, others use it as a cutoff point for considering sentiment. In other words, they throw out any sentiment scores that have an evidence score less than a certain point.

Confidence is something very specific to the named entities with associated Boolean queries. It is, by default, 1. "Aruba" is both a place (with nice beaches!) and a wireless networking company (with nice AP's!). In order to help the text mining software differentiate between the two, you could configure a confidence query as above (aruba AND (network or wireless)) - which would eliminate most articles about the place. Unless the article was talking about wireless networking in Aruba, but, you get the point. It should be noted that the text mining software does a pretty good job of distinguishing between Aruba (beaches) and Aruba (wireless networking) all on it's own - because our statistical model for named entity recognition and extraction has been well trained to recognize the difference between a place and a company.

demos

Instant demo.
No sales call necessary.




resources

Download whitepapers, datasheets, videos, get to support, you name it, it's there.



contact us

Ask us anything.