As I said in Part 1 of this series, the introduction of our lastest Language Pack for our Salience text analysis engine has us interested in issues surrounding Chinese language and analysis. Last time I spoke about the controversies surrounding the Windows 8 ads released in Asia and the extreme dialectical differences in spoken Chinese. This time, I'll be discussing written Chinese and the unique challeges we've faced while developing Salience for Chinese.
Somewhat counterintuitively, the dialectical differences of the Chinese language family isn't the biggest obstacle to written text analysis. Although there are many phonetic differences between topolects, the enforcement of a written standard means that they all share the same written characters, which incredibly helpful for text analytics. Even so, slang and vocabulary does change from place to place, but that is a common feature of most widespread languages, including English. Carl Lambrecht has already discussed very easily overcomes these differences in other languages with contextual part-of-speech tagging and the versatile customization of the software's lexicon.
The biggest difficulty with Chinese text and sentiment analysis are those that surround the unique way that a language such as Chinese is constructed. We’ll take you through the three biggest challenges and how we’re working to solve them:
- Simplified vs. Traditional Characters: While Simplified characters are the mainland standard, Traditional characters are still widely used in places such as Hong Kong and Taiwan. This problem was one of the easiest to deal with, using a basic one-to-one mapping to account for the presence of both.
- Named Entity Extraction: With languages that we’ve worked with before, capitalization has functioned as a useful way of identifying entities. This is something that isn’t present in Chinese writing, making it harder for the software to recognize entities. While our Named Entities for Chinese is still in development, we’re looking at an approach that uses machine learning algorithms alongside rules to get the most accurate results.
- Word Segmentation: Chinese is comprised of distinct characters, each one of which may represent a word by itself, or in conjunction with other characters. To tackle this, we’ve used statistical machine learning algorithms in a process called "tokenization". An interesting side effect of this is that we're able to deal with multi-word hashtags in other languages (e.g. #ilovelexalytics).