Sentiment and Accuracy

  2 m, 10 s

In my last post on our new sentiment features, I talked about perceived accuracy for sentiment techniques and sort of fudged around the issue of them. This is because I’m always wary of giving accuracy numbers for content in a general sense as it depends on so many factors that are external to your control, types of data, hand scorers POV, domains covered etc. Experience has also shown us that human analysts tend to agree about 80% of the time, which means that you are always going to find documents that you disagree with the machine on. However, having said all that, customers still like to be told a base line number, it’s human nature after all to want to know how something will perform, so I thought I would do a little test using the new model based system on a known set of data. As recommended on the Text Analytics mailing list I used the Movie Review Data put together by Pang and Lee for their various sentiment papers. This data consists of 2000 documents (1000 positive, 1000 negative) and I sliced it into a training set consisting of 1800 documents (900 positive and 900 negative) and a test set consisting of the remaining 200. It took about 45 seconds to train the model and then I ran the test set against it (using a quick PHP script). Now bearing in mind this is still experimental and that we plan to make more tweaks to the model, I was pleasantly surprised (ok I was more than pleasantly surprised) at the results. Our overall accuracy was 81.5% with 81 of the positive documents being correctly identified and 82 of the negative ones. This is right in the magic space for human agreement. For fun, I then ran the same 200 test set documents against our phrase based sentiment system, expecting a far lower score, but again we performed better than I thought scoring 70.5% accuracy. With a domain specific dictionary I’m sure that that score could be pushed up towards 80% as well. So what does all that tell us? Well, it tells us that for specific domain sets you can get very high accuracy levels, though if you ran say, financial content against the movie trained database the results would be far different. It also tells us that the phrase based sentiment technique produces good results even in its base state against a wide range of content sources (we normally are processing News data after all). So, next stage is to come up with some sort of hybrid I guess, to give us the best of both worlds. Where did I put that compiler again?

Categories: Sentiment Analysis, Technology