Classification is a few value proposition for text analytics - it allows users to quickly drill into articles of interest and look at trends over time. Setting up a classification scheme can be a lot of work.
The common techniques are:
- Using queries to bucket documents
- Using a machine learning model based on tagged document sets.
Model-based categorization starts with humans marking content by all of the categories they satisfy. Something like "Buying kimonos while on holiday" should fit into Travel and Fashion. Generally, you need hundreds of examples per category. Then you set a machine-learning algorithm loose on the data set. The machine analyzes all of the text in each marked document and constructs some kind of signature for each category. This looks easy on the surface - deciding whether something should go in one category or another is a relatively easy decision to make. A big issue with this approach though is what they call "overfitting" where the algorithm performs really well on the exact content you fed it but sucks elsewhere. Changing its behavior is not easy, nor is it obvious WHY something got categorized this way.
Query-based categorization starts with humans trying out search terms that should define the category and seeing how their retrieval works. So, in the above taxonomy some human gets to decide the keywords appropriate for Sports, Travel, Fashion, etc. Often this is done by people looking at articles and trying to figure out which words they should choose for their queries - they might pick "kimono" and "japan" for instance in my above example. This is difficult work to do - skilled labor is required plus a lot of time. However, queries are very transparent - you can see immediately why something matched - and they are easy to change. When you realize that "kimono" is not very predictive of fashion articles in general, you can delete that word and put in something else if you like.
There are pros and cons to each. Building queries requires a fair amount of thought and a lot of iteration, but queries are easy to change if the data changes, and the results are transparent. On the other hand, models just require users to tag documents and the machine does the heavy lift, but the results are opaque and adjusting a model for data changes requires a user to re-tag content - potentially quite a lot of content.
Salience supports query-based classification as well as concept queries, which are model-based, even though they use keywords. As you would expect, query-based categorization takes effort to create the proper queries, while concept queries are less predictable in what they return.
Recently we worked with a customer who wanted to classify content via queries, but found the amount of time required to create the queries unacceptable. The taxonomy was not particularly large, but the pieces of content were short, and too much time was being spent finding query terms that returned only a single document, or just a few. It wasn't obvious to the query creators that these terms were "onesies" just from looking at the content. We tried using concept queries, but the content had a lot of specialized terms that were not in the concept query model.
We decided that in order to keep the query features the customer liked (transparency, ability to quickly modify) and reduce the time and effort, we would try generating query terms automatically - in a way, building a model whose interior was exposed to the user. We started with having users tag a set of documents for each node in the taxonomy, and we used Salience to retrieve n-grams from the tagged text. Next, we looked at the predictive power of each n-gram - if we just used that n-gram as a query, what would the precision and recall for that term be? Next, we looked at what the P/R numbers were when we combined terms with OR, AND, NEAR (essentially a subset of AND) and NOT. Since we had the P/R numbers for each n-gram individually, we could set a desired precision or recall threshold and only add terms that passed that threshold, or passed only in combination with another n-gram.
The initial results from this were promising - instead of having to create queries by hand, which is hard to do, users only had to tag documents, but they received back a query that met their P/R thresholds and was user-refinable.
Their next question was - how do I know how many documents to tag? What we did was use an increasingly large set of the tagged documents to generate queries, and looked at the maximum P/R numbers we could derive from just that set. We then plotted those to illustrate how much was gained from each additional step up in the number of documents. These numbers varied from category to category, and also were able to reveal which categories were going to be problematic. If doubling the number of tagged documents doesn't move the needle much, there's probably something inherently problematic with the category. An example of this was a category which was "all other issues." It's difficult to describe what something isn't - what isn't a chair? An elephant? Comfy socks?
Obviously, these results are content-specific. Would this technique work as well for longer-form content, or for a more generalized taxonomy? We haven't tried that yet - a content set where single n-grams are not predictive of categories would make this difficult.
In sum, generating queries from a tagged data set combines the easier task of tagging documents with the maintainability and transparency of queries.