Categories/Tags/Topics/Classification

Overview:

Categorizing content is an extremely common practice, as it is a feature of many popular, highly trafficked websites. Sites like Google News, The Wall Street Journal, and TripAdvisor all use various categorization techniques on their content to segment their sites. 

For example, given an op-ed article titled, Who I think Will Win the California Mid-Term Congressional Elections, you will want to classify or associate it with a topic called "Politics".

There are many different ways of sorting content into categories, and choosing the sorting method that works best for you. Therefore Lexalytics has equipped our opinion mining and text analytics software with three methods from which you can choose. Included are two ways that we expose to general use, and one that you will need to contact us in order to utilize.

On this page you can find descriptions of each method, as well as a discussion of the pros and cons of each compared to the others.

Details:

Query Topics

Query topics are quite simple to understand: they are simply Boolean queries that, if matched, will associate a piece of content with a particular topic. To correctly classify the above op-ed title, you might have a query topic that looked like this:

Politics -- elect* OR congress* OR (president NOT ceo) OR senate* OR representative*

In this case, the query topic is "Politics", and the tags you are searching for are "elect", "congress", "president", "senate", and "representative".

Query topics are clear and simple to use. They are surgical and completely transparent, but work only on strings and not on "what you meant" -- in the above example, "president NOT ceo" is included in an attempt to prevent any mentions of business news from contaminating political topics. Without the clarifier, the computer wouldn't know if "president" referred to the governmental position or an executive of a company.

Lexalytics provides the following Boolean operators for query topics: AND, OR, NOT, and NEAR (where NEAR must be qualified with a distance – so that you can say something like “Aruba NEAR25 wireless” – which would only count as a hit if Aruba was within 25 words of the word “wireless”.)

Query topics work well if you know exactly the set of words that you want to look for. For example, if you were looking for all occurrences of the word “iPhone”, a query topic would be the method of choice. However, sometimes the words you are looking for may have different meanings in different contexts, or it just isn't clear what words you want.

Query topics ship with every version of Salience engine.

Model-Based Classifiers

Model-based classification works by collecting a significant set of training content for each of your categories, and then allowing the classification model to pick the words that are most statistically significant for that particular "bucket". This is a fairly simple process for the user, but there can be varying levels of customization involved. 

Model-based classification works best on medium-length content (around a page in length), and when you are trying to classify the content into relatively wide, well-separated buckets.

Say you have gathered a bunch of content about diseases and wish to classify it. Each individual disease makes up only a few documents, so the model relies on other words; it turns out that the word "with" actually has a high probability to contain content about diseases (as in you're diagnosed "with" something). A human putting together a query topic for diseases would not have made this connection; this is a completely counter-intuitive result, and shows the strength of models.

Models can work well with enough tweaking and in the right situations, but are not alway appropriate for a given situation. Because of this, Lexalytics does not expose our model-building capabilities by default. If you wish to build statistical classifiers, please contact us directly; we can advise you on whether or not they are going to work, and help you to avoid wasting your time. 

Concept Topics

Different kinds of text-mining categorizers take large amounts of work to set up and can be inherently brittle.

For a query-based categorizer with 100 "buckets", it can take as many as 50 or more hours just to get it up and running -- and once established, you will need to continually tweak your category definitions as pieces of rogue content get misclassified.

Lexalytics Concept Topics are designed to reduce the burden of this content analysis configuration through the use of our new Concept Matrix, generated from all of the content in Wikipedia™. 

The Salience Five release package ships with a number of example Concept Topics; below are two examples. The words next to "Food and Agriculture" represent all there are to the definitions of the Concept Topic

  • Agriculture -- farming, agriculture, farmer
  • Food -- food, meals, vegetable, meat, fruit

By considering the following sentences, you can see how well they match to each of the concept topics:

 

Food

Agriculture

I like chicken.

0.58

No match

I like chickens.

No match.

0.71

I like to eat chicken.

0.59

0.51

Here are some other examples from the Salience Five release:

  • Aviation -- aviation, airplane, flying   
  • Banking -- banking, bank, mortgage, checking, savings   
  • Beverages -- beverage, alcohol, soda   
  • Biotechnology -- biotech, biotechnology, applied_biology, gene_therapy, genetic_engineering  
  • Business -- business, management, executive, company, shareholder, mba   
  • Crime -- crime, murder, arrested, theft, burglary, criminal, arraignment   
  • Disasters -- disaster, tornado, earthquake, volcano, meteor, apocalypse, explosion, devastation   
  • Economics -- economics, economist, GDP, game_theory, demand_curve

So, the sentence "American Airlines had to announce a gate change." correctly categorizes to Aviation, even though the word "Airline" doesn't occur anywhere in the aviation category.

Concept Topics are revolutionizing categorization and the semantic analysis process. 

Concept Topics vs. Query-Based Categorizers

To see how Concept Topics work and why they are so important, take a look at a very simple example. We will build a simple categorizer for a single bucket (travel), then try to correctly classify the following review from TripAdvisor:

We were on the ship by 11:45. About 10 minutes after my VIP parents. Went to Lido deck for lunch. It was hard to find a table for ten or two table near each other so we went to the back of the ship near the pizza. The ship was full and you could tell. Very crowded. We got to our room at 1:30 met Edguardo our room steward. Loved him got everyone's name but had a hard time with mine and called me misses all week. The room was the same as on the Splendor which we did two years ago. 4 of us had plenty of room and loved the balcony. We had early seating in Washington Dining room 3rd floor in the middle so no view for us. I wanted to do anytime dining but with ten it wouldn't of worked. The boys and us went to club sign up. My 14 year old was going to be 15 in September and I wanted him moved to club O2 they said to write it on the form. He was able to switch no problem. Club started at 9:00pm. They had a great time all week and came home 12:30-1:00 every night.

 

Query based Method

Concept Topics

Definition

Motel OR hotel OR show OR resort OR pool OR travel OR vacation

Travel, Tourism

Time to define

~ 5 minutes

~ 10 seconds

Succeed / Fail

Fail

Succeed

Of course it would be easy to modify the basic categorization query to make it hit for the sample review -- however, you would have to test and refine this definition for quite a while before the query for travel was broad enough to achieve more hits than misses.

The concept topic, on the other hand, required minimal time and effort to create and succeeded the first time. It worked because the Concept Matrix understands that words like cruisefood, and entertainment are related to travel, so it is reasonable to place the review into that category.

Concept Topics vs. Model-Based Classifiers

Considering the same review text, it is perfectly reasonable to believe that one could build a model-based classifier that would do a great job of finding tourism-related content. The problem is, it would take way more time than just configuring a concept topic. And what about when you want to change a definition a little bit? Concept topics are excellent because you don't have to retrain a whole concept topic; you just add and subtract from the definition.

Where Don't Concept Topics Work Well?

Given the ease of defining concept topics and the high degree of accuracy in broad topic areas like Food, Travel, and Business, it is easy to assume that concept topics are a magic solution for any and all categorization problems. 

However, while they represent a huge advance in categorization technology, they are not cure-all solution. If, for example, you were conducting a very detailed, low-level categorization of drug interaction reviews and wanted to classify content by drugs for disease catgories (e.g. stomach cancer drugs), then concept topics would not work.

In these detailed cases a query-based or training-based model will be more effective and will require less effort than they would when trying to define a broad/general category.

This is also new technology for us, with profound implications for how we approach categorization of content. We are still working to understand all the limitations and already have some improvements in mind.

While not perfect for every situation, concept topics are a significant enhancement for general purpose categorization because they solve the very difficult problem of categorizing into generic, high-level categories: a historically difficult task to handle due to the of the breadth of things that fall into them.