Sentiment Accuracy: Explaining the Baseline and How to Test It

I often hesitate to offer benchmarks and general metrics for sentiment accuracy. Why? Because sentiment accuracy depends on so many factors: the type of data you’re dealing with, the people who hand-tagged your sentiment library, how much sleep they each got the night before, the complexity of the language that your industry uses (financial and medical data is particularly arcane)… the list goes on.

That said, it’s natural to want to know how something will perform. And I often talk with prospects and customers who’d like to hear a baseline number. So, in this article I’ll try to break things down.

Setting a baseline sentiment accuracy rate

When evaluating the sentiment (positive, negative, neutral) of a given text document, research shows that human analysts tend to agree around 80-85% of the time. This is the baseline we (usually) try to meet or beat when we’re training a sentiment scoring system. But this does mean that you’ll always find some text documents that even two humans can’t agree on, even with their wealth of experience and knowledge.

But when you’re running automated sentiment analysis through natural language processing, you want to be certain that the results are reliable. So, how accurate can we get, and how can we ensure the best-possible sentiment accuracy?

How to test sentiment accuracy (an example)

For a quick test of baseline sentiment accuracy, I built a new sentiment scoring model. As recommended on an old Yahoo text analytics mailing list, I used this Movie Review Data (link since removed) put together by Pang and Lee for their various sentiment papers.

This data consisted of 2000 documents (1000 positive, 1000 negative). I further divided it into a training set consisting of 1800 documents (900 positive and 900 negative), and a test set of the remaining 200.

It took me about 45 seconds to train a sentiment scoring model using the training set. Then I used a quick PHP script to run it against the test set.

The results

Now, remember that I built this sentiment model for speed as much as for accuracy. Even so, the results surprised me (pleasantly).

Of the 200-document test set, the model correctly identified 81 of the positive documents and 82 of the negative ones. This is sentiment accuracy score of 81.5%. That means that right off the bat, our basic sentiment scoring model already matched human agreement levels.

Next, I ran the same 200 test set documents against our phrase-based sentiment system. To be honest, I expected a far lower score. But I was pleasantly surprised.

Our simplest sentiment scoring models, trained on very general sentiment libraries, performed admirably, reaching 70.5% accuracy. With a domain-specific dictionary, I’m sure we could reach 80% accuracy or more.

What does this tell us?

So, what can we learn from this quick sentiment accuracy test?

Well, for one thing, this shows how automated sentiment scoring accuracy can easily reach or exceed the 80-85% human agreement baseline.

Of course, the best results will always come from analyzing domain-specific content with a sentiment scoring model trained on similar content.

For example, if you analyze a data set of financial content using a model trained on movie reviews, the results won’t be nearly so good. But try analyzing the same data set using a system that’s configured to understand financial language. You’ll find that you can achieve very high sentiment accuracy without much extra effort.

That said, this test shows how phrase-based sentiment scoring can produce good results, even in its most basic state.

Further reading on sentiment accuracy

Research paper: SentiBench – a benchmark comparison of state-of-the-practice sentiment analysis methods

Explainer: What is Sentiment Analysis, How Does it Work, and How is it Used?

Documentation: Lexalytics, an InMoment company, NLP Glossary