About a year ago, our CTO Mike Marshall did some accuracy testing on sentiment using our software. This wasn’t so much to showcase Lexalytics capabilities as it was to show that accuracy using automated sentiment can be helpful in the business process if done correctly.
One thing we do know for sure is that computers don’t change their minds about the sentiment for a certain piece of text. If you run the same piece of text through the software 100 times, it will come back with the same results every time. Humans, on the other hand, have the capacity to change their minds – and disagree with each other – on the same piece of text. But that’s okay. At Lexalytics we’ve never suggested you take human analysis out of the equation when it comes to analyzing unstructured content. In fact, our hope has always been to help the humans be more productive. Removing the neutral content is the goal, so the focus can be on the extremes within the content – the really positive or the really negative. I was recently surprised by a statement recently from Forrester Principal Analyst Suresh Vital that “in talking to clients who have deployed some form of sentiment analysis, accuracy rests at about 50 percent.” If this were to be true in our client base, we’d sadly be out of business. I hope as more and more companies enter into the sentiment analysis arena that they continue to test and retest their models. Below is Mike’s analysis from earlier in 2009:
Experience has also shown us that human analysts tend to agree about 80% of the time, which means that you are always going to find documents that you disagree with the machine on. However, having said all that, customers still like to be told a base line number, it’s human nature after all to want to know how something will perform, so I thought I would do a little test using the new model based system on a known set of data. As recommended on the Text Analytics mailing list I used the Movie Review Data put together by Pang and Lee for their various sentiment papers. This data consists of 2000 documents (1000 positive, 1000 negative) and I sliced it into a training set consisting of 1800 documents (900 positive and 900 negative) and a test set consisting of the remaining 200. It took about 45 seconds to train the model and then I ran the test set against it (using a quick PHP script). Now bearing in mind this is still experimental and that we plan to make more tweaks to the model, I was pleasantly surprised (ok I was more than pleasantly surprised) at the results. Our overall accuracy was 81.5% with 81 of the positive documents being correctly identified and 82 of the negative ones. This is right in the magic space for human agreement. For fun, I then ran the same 200 test set documents against our phrase based sentiment system, expecting a far lower score, but again we performed better than I thought scoring 70.5% accuracy. With a domain specific dictionary I’m sure that that score could be pushed up towards 80% as well. So what does all that tell us? Well, it tells us that for specific domain sets you can get very high accuracy levels, though if you ran say, financial content against the movie trained database the results would be far different. It also tells us that the phrase based sentiment technique produces good results even in its base state against a wide range of content sources (we normally are processing news-related data after all).
So, would you agree?