Salience has three methods of categorizing text: query topics, concept topics, and document categories. I’ll be going through all three using a New York Times article called “China Says Former Security Chief Is Under Investigation“.
Query topics are standard Boolean-logic operators: Lexalytics provides AND, OR, NOT, and NEAR (qualified with a distance) operators for query topics. We have several sample query topics configured for the demo that ship standard with the product.
For example, the “Environment” topic included with the Salience 5.2 demo application incorporates the tags: environment OR “green research” OR “global warming” OR contamination OR pollution.
In Salience, each relevant query topic is presented with the number of hits it accrues and the overall sentiment associated with the topic.
Query topics function well as agents of surgical precision: if you have a limited set of keywords, and you know exactly what those keywords are, they’re quick to setup and easy to use.
That said, query topics have one major inherent flaw: unless you’re looking for a small set of very specific keywords, the time required to customize your query topic can be considerable.
Imagine you’re searching for items related to the topic of “Food”. How many individual keywords would you need to add to ensure you’re gathering all relevant hits for food? Hundreds? Thousands? Millions?
Humans have a lot of words to define and describe food.
Enter the broad strokes of concept topics.
Concept topics utilize the Lexalytics concept matrix, generated from all of the content on Wikipedia, to categorize content.
Concept topics work well for high-level categories like Aviation, Food, and Business, each of which have definitions made up of words that associate with the topic.
Concept topics are, by definition, very broad. This means that they’re quick to setup and function as very effective first-line categorization tools.
For each concept topic associated with your input text, Salience presents an association score and sentiment score with it. In addition, the “explain” API call shows which keyphrases matched with the document and why.
Here you can see that the news article is associated with two concept topics, Politics and Law. Within the Politics topic, which has an association score of 0.583, you can see how the text stacks up with each of the keywords inherent in the Politics concept topic.
Concept topics work because the Concept Matrix understands how words from your input text associate with each concept topic category.
Concept topics require minimal time and effort to create and function immediately. That said, many of our customers find great success with small amounts of customization.
Document categories classify the entire input document into a set of 4,000 categories that are then rolled into 125 higher-level categories.
That’s right, Salience comes boxed with over four thousand pre-configured document categories.
Document categories are much like concept topics; in fact, they’re concept topics that are auto-generated based on the Wikipedia taxonomy. You have full control over them with respect to manipulating the taxonomy and adding/editing/deleting nodes.
Automatic document categories are a great way to get a fast-pass look at the content.
Think of document categories as a third, even higher-level categorization system above query topics and concept topics:
- Query topics require precise customization, and serve as highly targeted search and category functions
- Concept topics require less precise customization, but still benefit from some tailoring in order to see the best results, and serve as broad categories
- Document categories are wide-ranging, are a good place to start, and do provide hierarchical customization.
A pharmaceutical company, for instance, may gather documents that fit into all “healthcare” or “pharmaceutical” document categories, focus on remaining documents that hit with concept topics such as “drugs” or “medicine”, and then utilizing query topics to hunt down specific items, such as mentions of a specific drug or study.