The Ultimate NLP Tuning FAQ: Sentiment, Topics and More

  11 m, 20 s

Lexalytics Support Engineer Sarah Williams answers our users’ most frequently-asked questions about how to tune our natural language processing features.

Let’s dive in!

Sentiment tuning

Do added (user) sentiment phrases overwrite default scores?

Yes, user-defined sentiment scores will overwrite any default Semantria scores. You may want to include sentiment phrases to correct what you see as ambiguous or incorrect polarity. For example, “pretty” may be used in the context of a “pretty dress” and will therefore carry positive sentiment. However, pretty may be used as a neutral intensifier, as in “it is a pretty ugly dress.” Therefore, you may wish to create a custom sentiment phrase for “pretty” where the sentiment score is “0.0” to overwrite the default positive sentiment.

I don’t see any sentiment phrases in my configuration?

The built-in sentiment dictionary for Semantria does not list or show all its sentiment phrases. If you disagree with some of the scores or need to change a score to better suit your content and use case, then you can add a sentiment phrase.

What scale should I use to score my sentiment phrases?

We recommend that you stay between -1 and +1.

What is the scoring guide?

Tonality Weight
Complete positive 1 perfect
Mostly positive 0.6 great
Slight positive 0.3 good
Slight negative -0.3 poor
Mostly negative -0.6 bad
Completely negative -1 terrible

Can I change the scores to fit my custom scale?

If you’re using Salience, Semantria API, or Semantria Excel then you may map our scores to a scale of your own.

Why am I seeing scores above or below *insert value here*?

Sentiment scores are not bound by a cap, either positive or negative. If a document has an excessive number of negative sentiment phrases, then the document’s sentiment score will be overwhelmingly negative to reflect this.

How do sentiment scores get calculated?

Sentiment represented by a simple projection from a continuous value (-1 to 1) to a discrete set of values (very negative to very positive). The Salience engine identifies values by both the sum and the parts. In other words, emotive language within a document will be identified and scored for individually. Then the scores will be combined to discern an overall sentiment score for a given document. Take this document as an example: “We were served quickly, but the waiter was rude and the food was cold.” Components of the document bear positive sentiment, but the overall document sentiment is negative.

Are sentiment phrases case sensitive out of the box?

No, sentiment phrases are not case sensitive out of the box.

How can I make sentiment scores case sensitive?

Adding ~ before a sentiment phrase makes it case sensitive.

Can you create query-defined sentiment phrases?

Yes. You can do so by navigating to “sentiment” in the Configurations dashboard and using Boolean phrases.

How does negation work?

If a sentiment phrase has a word such as “not,” “never” or something ending in “n’t” before it, the sentiment of the phrase will be reversed. A positive phrase will become negative, while a negative phrase will become positive.

Why is *word or phrase* not being negated?

If something should be negated but is seemingly not being negated it is usually due to the distance between the negator and the sentiment phrase being too great for the negation pattern.

Query topics

Are queries case sensitive out of the box?

No.

How do I make queries case sensitive?

To make a keyword of a query case sensitive, add ~ in front of the keyword and add quotes around the keyword. For example, ~“Steve Jobs”

What is the difference between WITH and NEAR?

A WITH operator requires that the two terms occur within the same sentence. A NEAR operator is effectively an AND operator where you can control the distance between the words. Check out this resource on Boolean query syntax to learn more.

What is the difference between NOTNEAR, NOTWITH and EXCLUDE?

NOTNEAR is effectively a NOT operator where you can control the distance between the words. A NOTWITH operator requires that the two terms cannot occur within the same sentence.

Two query terms of any type may be joined by an EXCLUDE operator.

Example: York EXCLUDE “New York.”

The effect is different than that of the NOT operator. The query will return documents with the word “York,” excluding those that only contain occurrences of “New York.”

How are parentheses used in queries?

When in doubt you should use parentheses in queries to clear up confusion. Take the query-defined topic for “airport wayfinding.” Parentheses may be used to structure the underlying Boolean phrases like this:
((find OR search OR “going” OR locate OR navigate OR orient) NEAR/5 (way OR gate OR terminal)) OR sign* OR lost OR placard* OR map OR navigate OR shuttle* OR airtrain* OR “air train*” OR bus

Categories and concept topics

What is the difference between Categories and Concept Topics?

There is no difference! Categories are referred as Concept Topics in Semantria Excel.

What is the difference between Categories and Queries?

Unlike Queries, Categories are not based on phrase matching. Categories create their relevance scores based on the Concept Matrix. For example, if you include the term “food” in a query the Salience engine will only match content that includes the exact word ‘food.’ If you include the same term in a Category, you will get results for anything mentioning pasta, sandwiches, and soup — even if the document does not mention the term “food” explicitly.

What is the Concept Matrix?

The Concept Matrix utilizes the links between Wikipedia™ articles to develop a matrix of semantic associations between keywords and phrases, forming a comprehensive web of large concepts, the topics and entities that branch off of each concept, and the links between all of these articles.

Think of it this way: one Wikipedia article about a political party may contain dozens of links to other articles, all related to the concept of politics. Each of those articles, in turn, contains links to other pages, and so on. The closer an article is in this chain to the original topic, the more closely related it is to that topic, and the stronger the association is.

Best practices for Categories?

Although you can create long, complicated concept queries, we do not recommend the practice. Besides being an issue for maintenance, results generally degrade with complexity. A better solution is often to break very broad topics into smaller subtopics, and map these subtopics together in a consuming application. That is, rather than having a single category for all of ‘Business,’ you may have better results pulling in ‘human resources,’ ‘regulatory compliance,’ etc. separately.

What is the difference between Categories and Auto-Categories?

Auto-Categories have subcategories, while user-defined Categories do not. Also, Auto-Categories are not part of a taxonomy, whereas Categories can be.

What is a Boost Value in the Configurations tab?

Boost value is a multiplier that is applied to the relevance score coming out of the Salience engine for Categories. This is helpful because sometimes, no matter what you do, a category always returns a low relevance score. Since we encourage our users to set a relevance threshold to weed out irrelevant topics, we developed a way to boost Categories. This helps customers keep a sane relevancy threshold while helping Categories that always score low make it over this threshold.

What is the CONTEXT operator in Categories?

Whereas the NOT operator excludes certain ideas implied by a query, CONTEXT highlights those ideas. Imagine you are analyzing documents related to automobile manufacturing. The query “automobile, manufacturing” is likely to match many concepts relevant to automobile manufacturing. Nonetheless, this may also pull in articles about general manufacturing. The alternate query ‘automobile_manufacturing’ is highly specific, possibly too much so.

Creating the query “Automobile CONTEXT manufacturing” will result in a search for automotive in general, with a focus on manufacturing. It will not return results about general manufacturing.

Specifically, the text to the left of CONTEXT supplies the general idea being searched for, and the text to the right supplies the ideas you want that topic to be discussed with.

How do I tune Categories?

Precision:  number of correct matches ÷ (number of correct matches + number of incorrect matches)

Recall:  number of correct matches ÷ (number of correct matches + number of matches missed)

An ad hoc analysis of results is sufficient during rapid development of Categories and concept queries.

However, if the aim is optimal performance and robust results, formal testing is encouraged. You may use sound statistical data to guide decisions. Consider annotating a selection of documents, specifying the categories with which you wish to match (for example: automobile manufacturing). You may then calculate precision and recall for each query.

With these annotations you can make small changes to category definitions and weightings and see whether your results improve. You may also identify any concepts with which the Concept Matrix has difficulty, addressing those queries specially. To do this, break the topic into specific subtopics or simply fall back to more traditional query based topics.

Themes

Why can’t I tune Themes?

Themes are noun phrases based on non-proper nouns. Tuning themes requires the user to understand grammatical structures and PoS (part of speech) patterns. The average Semantria user will not benefit from manipulating these patterns. Further, there is no whitelisting of Themes; you have to be able to capture phrases you want with the Lexalytics PoS tagger.

How does Semantria decide a Theme exists?

By matching the PoS patterns defined in the rules.ptn file in the Themes directory in Salience. The rules.ptn file is not exposed to users in Semantria. There are five different patterns that will match a non-proper noun phrase as a theme.

Do Entity Themes depend on the existence of Themes?

Yes, otherwise it is not a Theme.

An Entity theme is simply a Theme that relates to an extracted Entity, as determined by the Salience engine.

What are sentiment scores of Themes? What do they indicate?

Like topics, entities, phrases, and Categories, Themes bear sentiment weight (positive, negative, neutral). When the Salience engine identifies a Theme in a text document, it will determine whether or not said Theme carries sentiment. Now the user knows if the Theme is being discussed in a positive or negative (or neutral) way.

Named Entities

Are Entities case sensitive out of the box?

No, entities are not case sensitive out of the box.

What are Query Defined Entities?

Query Defined Entities are a way of involving the Lexalytics Filtering Engine in Entity discovery. Context can be important in identifying an Entity. Famous people share names. Products can be named using common words which show up in different contexts. In these situations simply matching the Entity text is not enough: you must also ensure the context is the expected one. While confidence flags can handle simple binary situations, Query Defined Entities provide the ability to create very finely tuned Entity rules.

Unlike regular entities, Query Defined Entities can overlap. This can be useful for entities that cross type boundaries (Boston Red Sox as a sports team and as a mention of a city), or for creating different rules for an Entity depending on context. For example, you might create two separate normalizations for Arnold Schwarzenegger. Using Query Defined Entities divide the data between discussions of his political career and his movie career. An article that discusses both will match both contexts and return both forms of the Entity.

How do I define an Entity with a query?

Making query defined Entities in Configurations is simple. In the “Entity” field of the object simply start your query with a ‘+’ symbol (e.g. +Apple Inc.).

What is a Label?

A Label field returned in the API output will give more information about the Entity, such as a link to a Wikipedia page. This is useful if you wish to better understand how Semantria arrived at its decision.

What is a Type?

A Type field, usually used to differentiate types of entities from each other, might contain information such as Company, Person, Product and so on.

What does Normalized Value mean?

You can use Normalize Value to normalize different forms of the Entity to the same value. For instance, one entity might be Coke, and another Coca-Cola. You may want to normalize both of these to “Coke”. If you enter the same normalized value for each Entity then they will appear in the output as the same thing.

Taxonomies

What is the purpose of Taxonomies?

The purpose of Lexalytics Taxonomies is to allow for hierarchical categorization that isn’t quite available in any individual feature. Taxonomies allow for the user to implement cross feature hierarchies, as both Categories and Topic queries can be assigned a node in a taxonomy.

What goes in a Taxonomy?

The user creates their own node hierarchy and assigns Topic queries and/or Categories to any node they want. These Topic queries and Categories are referred to as “leaves.” A Taxonomic node may look like this: [Hotel Staff] → [Concierge] → [Concierge Attitude] → (Concierge Attitude) where [] represents a node and () represents a leaf.

What are some Taxonomy best practices?

Always keep in mind the concept of precision and recall when creating node definitions. It’s best to not make the node definitions too restrictive. If this happens, the node will scarcely get any matches, rendering it statistically insignificant. On the other hand, it’s also best to avoid making node definitions too wide, as this may bury any signal in noise.

Categories: Natural Language Processing, Technology