Elizabeth Baran discusses the development of the Chinese Native Language Pack for Salience, the beta version of which was released earlier this year. She covers the unique and interesting challenges of developing Salience for Chinese as well as what Salience can currently do with Chinese, and what it will be able to do in the future.
Baran starts out with a discussion of some of the challenges specific to Chinese. Of particular interest is the word formation and lack of work boundaries in written text. Written Chinese is comprised of characters with their own set meanings, and words can be comprised of one or multiple characters. With no marking to indicate where a word begins and ends, the process of tokenization, or breaking up text into meaningful elements, can be difficult.
Currently the beta version of Salience for Chinese supports concept topics, document details, n-grams, and sentiment analysis. The one major function unavailable in the beta version is Named Entity Extraction, which will be included in the full release, due sometime this month. Salience can interpret both simplified and complex characters in Mandarin. Salience does not support Cantonese.
Elizabeth demonstrates Salience’s capabilities using both a Chinese technical and financial article as examples. For both articles Salience is able to provide major themes, sentiment analysis and n-grams at the character level up to 4-grams.
Elizabeth Baran is Lexalytic’s Chinese Language Expert. She’s fluent in English, French, and Mandarin and lived in China for 8 months while participating in a language immersion program. She has a Major in Chinese Language, and a double minor in Linguistics and French from Georgetown University.Additionally, Elizabeth has published two papers on Chinese NLP, which were funded by the National Science Foundation under a Chinese-English Machine Translation grant, is a member of the Chinese Natural Language Processing Group at Brandeis University (2010-2012), and has presented at the Machine Translation Summit XIII held in Xiamen, China, for research on automatically predicting noun number in Chinese.