Chinese Sentiment Analysis – Preliminary Accuracy Results for Highly Toned Content

  5 m, 50 s

Here at Lexalytics, we’re excited to be in beta with Salience Text Analysis for Chinese.

There are many features in our toolkit – sentiment, topic detection, summarization, theme extraction – but sentiment is what we’ve been best known for. With our beta release of text analytics of Chinese content, we decided to measure our document-level sentiment results against human annotations of sentiment, and compare to another public engine recently released that also provides automated sentiment analysis of Chinese.

What started as a basic measurement of precision and recall turned into a deeper effort to quantitatively determine how closely our sentiment analysis matches the sentiment judgment of multiple humans.


We gathered 109 pieces of Chinese content from Weibo and blog discussion forums each of which we annotated as positive or negative. The content was filtered to items that were clearly positive or negative to a human. Our intent was not to measure ability to detect subtle cases of sentiment.

Even though we marked content as only positive or negative, Salience, being phrase-based, can return a neutral result if it is unable to detect any sentiment at all. We think this is a valid approach – sometimes there really isn’t any sentiment.

The other sentiment engine that was tested, Chatterbox, was not observed to return a neutral result; it appears that all content is categorized as positive or negative, with an associated strength score.  It could be argued that you could consider a polar result with a small strength score to be “neutral”

Additionally, the phrase-based approach developed for Salience to assess sentiment in Chinese was developed from longer form text, but shorter text was needed for this test to accommodate a content length constraint imposed by the Chatterbox API.

Precision, Recall, and F1

The table below gives the precision, recall, and the weighted average (F1) for positive and negative sentiment within the set. The F1 score for positive sentiment and F1 score for negative sentiment are combined to calculate an overall accuracy measure.

precision recall accuracy comparison chatterbox salience

These scores are quite good for both engines, which can be attributed in part to the polarity in the test content selected. Salience performs comparative to Chatterbox in terms of positive sentiment, with slightly better performance on the detection and identification of negative sentiment.

As we have developed support for non-English languages, we have done so at a core level of deconstructing the language, and developing the support needed to handle the language natively. We feel this is a much better approach to NLP of non-English languages than using machine translation techniques to apply techniques developed for English to translated text. In order to test this, we took the Chinese content and ran it through the Google Translate API, and ran it through Salience’s standard distribution for English.

precision recall comparision with translation

As you can see from the table above, a machine translation approach suffers from any translation issues, particularly when working with phrase-based detection of sentiment where other linguistic modifiers such as negations and intensifiers are taken into account. A model-based sentiment approach may be less affected by these issues, but will be less flexible to use across varied content domains and require more technical tuning effort.

Inter-rater agreement

To me, this is the most interesting part of the experiment. Automated sentiment analysis is often compared to human sentiment analysis through precision and recall tests. But that assumes that across humans there would be 100% agreement. In reality, there are discrepancies in the sentiment annotation across multiple humans, and in many cases the same human can mark the same set of documents with slight differences from one day to the next. So we want to measure the consistency of multiple human annotations of the same content, and calculate the consistency of the two automated sentiment approaches to human judgment. After all, if you can’t agree with another human, why expect the machine to agree with you?

The same content was also annotated by an external contractor, a native Mandarin speaker located in China. Two inter-rater agreement measures were calculated, Krippendorff’s alpha and Cohen’s kappa. These two measures were also featured in an analysis of inter-rater agreement presented by Maritz Research at the Text Analytics World seminar in the fall of 2012.

For our dataset, both Krippendorff’s and Cohens’indicated an agreement of 94% across the two humans, showing that even for a relatively small set of very polar content there is not absolute agreement between two humans.

In one particular example, we marked an article that was very prejudicial against Japan and pro-Chinese as negative, because of the emphasis of the prejudice. The contractor based in China, however, considered this to be positive content. Judging sentiment can be tricky, even for humans.

So how did the computers do?

Calculating Cohen’s kappa requires the results to be fully annotated, so for cases in which Salience returns an inconclusive result we considered what agreement would be if that result was taken to be a positive or negative result.   In other words “0=neg” means that if Salience returned a result of 0 (one that we would normally consider to be neutral), we will consider this results to be “negative”.

kappa scores

Cohen’s kappa also only allows us to calculate agreement between two raters. The conclusion we can draw from this calculation is that humans agree 94% of the time on their ratings of the content, Salience agrees between 74% and 84% of the time, slightly better with one human than another and slightly better when inconclusive results are considered negative (or not-positive). Chatterbox fares worse with about 62 to 64% agreement with humans.

The calculation of Krippendorff’s alpha is more flexible, allowing for gaps in the annotations which accommodate cases in which Salience did not detect sentiment and allowing for determining sentiment across a group of more than two raters.
alpha inter rater agreement

The most interesting chart is below – so, for multiple raters (more than two), what are the best combinations. Because we’re not forcing a “0” from Salience into either positive or negative, the agreement numbers end up better.   We’re about 75% agreement across humans + Salience + Chatterbox.   (Which says good things about the state of sentiment analysis!)   We’re, of course, happiest with the results between two humans and Salience, where we’re pushing 90%.

alpha scores with c greater than 2

These results show that Salience’s analysis of sentiment in Chinese content correlates well with human judgment. Perhaps not quite well enough to satisfy a Turing condition of generating results which are indistinguishable from those of a human, but certainly close enough to serve as a good starting point from which further phrase-based sentiment tuning can refine results.  


We think these results validate our approach of native NLP phrase based sentiment analysis for Chinese over a machine translation approach or classification model and show that on tonal content, Salience is a compelling option. At present, our attention is focused on including named entity recognition for our general release of support for Chinese. We’re pleased with the results of this assessment of our initial document sentiment analysis, and looking forward to bringing the full product to market.

Categories: Announcements, Language Packs, Text Analytics