Named Entity Extraction

Overview:

Salience will automatically extract companies, people, places, products, dates, URLs, hashtags, @mentions, phone numbers, currency amounts, and more. If you want to configure how Salience extracts entities, there are different methods, from simple lists to training models.

Named entities are important in that our software will then go on to calculate other metrics to go along with the entity - things like the entity's sentiment analysis score, or the associated themes.

Details:

There are five different methods Salience combines to extract named entities in order to provide the highest quality results:

Named Entity Recognition

Lists: simple lists of named entities, say a list of car manufacturers, or a list of people, or a list of trees

Patterns: part of speech patterns that allow you to define things like noun phrases to be treated as named entities, or define particular verb phrases, or whatever part of speech patterns that matter to you

Regular Expressions:  using the Boost RegEx library, you can define things like phone numbers, date ranges, email address types and have them treated as named entities

CRF Model:  Lexalytics provides a pre-trained Conditional Random Field model that already recognizes the following named entity types:  Person, Date, Place, Company, Product, Job Title

MaxEnt Model:  if you need to train a model to recognize things like "types of cancer", we provide the tools for you to do that with our Maximum Entropy-based model.

Note that all of these extraction types operate simultaneously, with priority given to the simpler techniques (meaning, if you put "Lexalytics" as a company in a list, and the model is only 80% certain that Lexalytics is a company, the text analysis software is going to be 100% certain it has picked out Lexalytics as a company - because you put it in a list).

When the named entity object is returned, it has certain parameters associated with it, as below:

Entities Returned Information

Named entities are always returned in a normalized fashion, so, you could have all of the different ways that a name is expressed (President Obama, Obama, Barack Obama, Barack Hussein Obama) all get normalized to a single name: President Barack Obama (or whatever you chose for it).

Mentions is, well, the number of times the named entity is mentioned in the content.

Labels are user assignable -- consider it to be a sub-category of named entity type. For example, if you have an entity type of "company", you could have labels of "customer", "competitor", "partner", "supplier"; or whatever you choose.

Sentiment is the tone for that specific named entity in that document. Sentiment analysis is returned as a positive/negative magnitude that represents the amalgamation of all mentions of the named entity in the content. We do not decide the range of "positive/negative/neutral" - that's up to you in your sentiment analysis tools, we just say whether it's positive or negative. To be more clear, if the sentiment is +0.1, that's probably neutral. However, if it's +1.1, then it's certainly positive. Setting the boundaries of neutral is up to the individual implementation, though we recommend a boundary of -0.2 to +0.2.