Recently I had one of those unfortunate circumstances where a customer of ours was unhappy with the results of the sentiment produced by Salience and contacted us to help tune the engine. The document set was a couple hundred articles mentioning a particular company. About 25% of the mentions scored by Salience as negative were rated by humans as mostly neutral in tone.
As I dug into the articles, it became clear that most of the articles where Salience and the humans disagreed were articles where the company in question was co-mentioned with lots of disappointing economic news - "credit crisis", "in a poor economic climate" and the like were frequent phrases. In addition, most of those articles had only 1 or 2 mentions of the company in question. Although there was often little grammatical connection between the negative phrases and the company, without any positive elements, the company score wound up weakly negative.
It seemed clear that this company was the victim of guilt by association. It was the "correct" result from an engine perspective, but not from a human one. What we decided to do was to introduce a scoring step that took into account the number of mentions and the number of scoring phrases - an output from Salience called "evidence." Essentially, what the client decided to do was to score as neutral any article that contained only a single mention of the company and only a single scoring phrase - regardless of what Salience returned as the score.
The idea is that, at bottom, machine scoring of tone is a statistical guess and when there is little to base the guess on, you're likely to be wrong. Humans react poorly to something scored as negative in these cases but are more accepting of a neutral score for these "passing mentions" even if they themselves might score it as a positive or negative. Another way of looking at it is that humans accept that sentiment is a continuum and what they really don't like is the machine to be a polar opposite.
After we did that (and a few other steps similar in nature) we managed to get the document set into quite close agreement with a human. What about other documents, though? Had we succumbed to the dread "overfitting" problem? To determine that, we assembled another group of several hundred documents picked at random from the rest of the set and had the same group of humans score the new documents with the new algorithm. We found that in fact the agreement with humans went up by more than 10% compared to the earlier scoring algorithm. Also important was the near elimination of "polar opposite" scores.
But what about Twitter, you might ask, or other kinds of short documents where you probably will only ever get a single mention? Should we score all those as neutral regardless? We weren't scoring Twitter and the like in this instance, but it's a valid criticism. My instinct is that with Twitter type content the original problem would not occur often. In other words, with only 140 characters, you won't have a lot of extraneous sentiment from poor economic news cluttering up the message.