Curious about how our text analytics API sizes up? We’ve written a series of articles to demonstrate some of the basic functionality provided by Salience. Today I'll be walking you through the basics of Themes and topic extraction.Salience Themes are words or phrases that provide context for a body of text. Themes are often discussed in the context of “topic detection” or “topic extraction” – where you don’t really have classifiers pre-configured, and you want to figure out just what are the important concepts in the text.
From a technical perspective, themes are “context scored noun phrases.” For a longer document, you could have potentially hundreds of themes, but with the magic of Salience, you can tell which themes are the most important to grasp the meaning of the document. You can think of them almost as hyper-summaries. For a more detailed explanation of how Salience handles themes, see here
Put differently, Themes indicate common, well, themes, in all of your inputs. Hence the name. These commonalities are invaluable in guiding your decision-making. If “poor packaging” is a common theme, for example, you know to look into the processes used to package your products.
To give you an idea of Salience’s capabilities, I’ve grabbed the text from an NPR.org article titled “Rust Devastates Guatemala’s Prime Coffee Crop and Its Farmers
I have copy-pasted the text of the NPR.org article into the Salience/Semantria web demo
and run the process. This is what I got back.
As you can see, the demo presents me with a list of themes it detects within the body of text, sorted by strength of association and each assigned a sentiment score.
Pretend for a moment that I haven’t seen the title of the article, and that I know nothing about the text going through the system. Seeing this chart with fresh eyes, I still know the topic and focus almost immediately: coffee production in one or more Central American countries is down by 15 percent, due to the spread of a plant fungus.
Let’s back up and break this down.
The theme with the strongest association is “causing crop losses”, so I know the subject of the input text.
The strongest sentiment Salience finds is with the “coffee output” theme, coming in with a whopping minus-4.32 sentiment. Now I know that coffee output is down due to coffee crop losses, and the strength of the sentiment associated with both themes suggests that output is down by a large amount. The “Alarming rate” theme reinforces the strength of the problem.
The theme with the second-strongest sentiment clarifies the point: “15 percent drop” has a sentiment of minus-3.64, so I know exactly how bad this problem is.
From my very limited prior knowledge of coffee plants, I remember that they are primarily grown in tropical climates. Based off of that knowledge and the “hot weather” and “shady volcanoes” themes, I estimate the location of this problem to be Central America – a place with both hot weather and volcanoes, (For more on extracting proper nouns, stay tuned for our Named Entity Extraction walkthrough).
And finally, “plant fungus” conveniently explains the basic cause of the crop loss.