Tagging, Taxonomies, Categorization with Salience

  4 m, 55 s

The world is your oyster... And if your world is data, Salience is your pearl. One of the things that makes this engine so great is its ability to categorize content automatically. It can tag high volumes of unstructured text, making it super easy to filter and sort through these documents. Here's a simple "how to" in 3 steps.

Step 1: Discovery

There are a few different discovery tools in Salience that will reveal what the content is about in your text.

One of these features is called Themes, and they work like this: every time there are proper noun phrases in a piece of text, salience will extract them. For example, “awesome people”, “rude staff”, “tasty food”, “fresh lobster”; you can read all of these themes and get an idea of what your data is about.

Another way to see what your data is about is to run a collection analysis. You will find a list of facets, which are keywords that occur frequently throughout the entire data set. For example, the facet “staff” appeared 9,876 times in 100,000 tweets, the word “service” appeared 11,387 times, etc.

We can take this one step further and look at the attributes, or words that describe the facets. For instance, when we look at the facet “staff”, we see the following attributes, “helpful”, “awesome”, “rude”, “unpleasant”, etc. All of these attributes are directly associated with the word “staff”. If you were a manager looking at this data, you may be concerned that the words “rude” & “unpleasant” are associated with your employees.

You might think, “What other negative words are being said about my staff?” The easy answer to this question is to build a concept topic.

Step 2: Building a Concept Topic

We will cover two different ways to build categories in salience. The first one is called concept topics, and automatically tags content based on an ontology built from Wikipedia’s semantic knowledge. All you have to do is give it a few category samples, which are keywords related to your concept topic. For example, if you wanted to create a concept topic called “cats”, you would give the category samples “cat”, “feline”, “lion”, & “tiger”.

Salience will use the relationship between the category samples to tag your data. So every time the word “lion” pops up in your data, that entry will be categorized as “cats”. Every time the word “cheetah” appears, salience will know that this animal belongs to the cat family, and will tag the document as “cats”. This method of categorization is awesome because you do not need to list every single member of the cat family to create this category.

Step 3: Building a Query Topic

Building a query topic is the second way to classify data with salience. This method is based on Boolean logic that’s very similar to your standard query language (SQL). This method is more accurate than concept topics, but requires a bit more work. Let’s say we wanted to design a query topic for “cats”. We would need to know all of the terms that would run throughout the data set. Here is an example of a “cats” query:

(cat* OR feline* OR meow* OR kitten* OR kitty OR puss* OR cheetah* OR leopard* OR tiger* OR lion* OR lynx* OR cougar* OR panther* OR jaguar* OR ocelot* OR puma*)

*If you are unfamiliar with this Boolean logic, check out our technical info on query topics.

Here are a couple of best practices for writing queries. First, check out the Wikipedia page for the name of your concept topic. You just may find some words associated with your query topic that you haven't though of yet. For example, before writing the "cats" query, I typed the word "cats" into Wikipedia and found the word “meow” in the description. Had I not done this, I would have left out a word that is strongly associated with cats.

Second, punch the name of the query topic into a thesaurus and you will find other words that are synonymous to the query topic. For example, when I entered the word "cats" into thesaurus.com, I realized that I had left out the word "pussy", which is a synonym for a cat.

So why would we build a query instead of simply using concept topics? The best way to answer that question is to give another example. Let’s say that we wanted to tag all data about albino cats. We don’t care about regular cats. We want the white, albino ones. Notice how I used the exact same "cats" query topic and simply added to it:  

((cat* OR feline* OR meow* OR kitten* OR kitty OR puss* OR cheetah* OR leopard* OR tiger* OR lion* OR lynx* OR cougar* OR panther* OR jaguar* OR ocelot* OR puma*) NEAR3 (white OR albino))

Now salience will only tag the entries that mention the first set of words (cat, feline, meow, etc.) if they are within 3 words of the second set of words(white or albino). So the sentences, "Look, albino cheetahs!" and "Those cats were white!?", both would get tagged by our "albino cats" query topic.

I know this example may seem a little far from any typical use case, so let's put it back into context. If you are in charge of the customer service department, you could build a “poor service” query topic that tags articles of your staff only if they are mentioned as being “rude” or “unprofessional".

Bonus Step: Take action!

Now that your data set is structured, it is easy to find insight and share it with your organization in the form of aesthetically pleasing charts or graphs. Everyone will see precisely what your organization is doing right or wrong. Keep doing the good things, correct the bad things. You get the picture.

For more info on the categorization features of salience, check out the documentation page.

Categories: Categorization, Topic Extraction