I often hesitate to offer benchmarks and general metrics for sentiment accuracy, because it depends on so many factors: the type of data you're dealing with, the people who hand-tagged the sentiment library you're working from, how much sleep they each got the night before, the complexity of the language that your industry uses (financial and medical data is particularly arcane)... the list goes on.
That said, it's natural to want to know how something will perform. And we often hear from prospective customers who'd like to hear a baseline number. So, in this article I'll try to break things down.
Baseline sentiment accuracy
When it comes to evaluating the sentiment (positive, negative, neutral) of a given text document, research shows that human analysts tend to agree about 80-85% of the time.
This is the baseline we (usually) try to meet or beat when we're training a sentiment scoring system.
But this does mean that you'll always find some text documents that even two humans can't agree on, even with their wealth of experience and knowledge.
So, how about automated sentiment analysis through natural language processing? How accurate can we get, and how can we ensure the best-possible sentiment accuracy?
Running a quick test of sentiment accuracy
For a quick test of baseline sentiment accuracy, I built a new sentiment scoring model. As recommended on an old Yahoo text analytics mailing list, I used this Movie Review Data put together by Pang and Lee for their various sentiment papers.
This data consisted of 2000 documents (1000 positive, 1000 negative). I further divided it into a training set consisting of 1800 documents (900 positive and 900 negative), and a test set of the remaining 200.
It took me about 45 seconds to train a sentiment scoring model using the training set. Then I used a quick PHP script to run it against the test set.
Now, remember that I built this sentiment model for speed as much as accuracy. Even so, I was pleasantly surprised at the results (ok, more than pleasantly surprised).
Of the 200-document test set, the model correctly identified 81 of the positive documents and 82 of the negative ones. This is sentiment accuracy score of 81.5%.
Right off the bat, our basic sentiment scoring model already matched human agreement levels.
Next, I ran the same 200 test set documents against our phrase-based sentiment system. To be honest, I expected a far lower score. But again I was surprised.
Our simplest sentiment scoring models, trained on very general sentiment libraries, performed admirably, reaching 70.5% accuracy. With a domain-specific dictionary, I’m sure we could reach 80% accuracy or more.
What does this tell us?
So, what can we learn from this quick sentiment accuracy test?
Well, for one thing, this shows how automated sentiment scoring accuracy can easily reach or exceed the 80-85% human agreement baseline.
Of course, the best results will always come from analyzing domain-specific content with a sentiment scoring model trained on similar content.
For example, if you analyze a data set of financial content using a model trained on movie reviews, the results won't be nearly so good. But the same data set, when analyzed by a system that's built to understand financial language, can achieve very high sentiment accuracy rates.
That said, this test shows how phrase-based sentiment scoring can produce good results, even in its most basic state.
Further reading on sentiment accuracy
Documentation: Lexalytics NLP Glossary