The Evolution of Multi-Lingual Support at Lexalytics

At Lexalytics, we know it’s not only a global marketplace, but a multi-lingual global marketplace. It’s this understanding which has driven us to extend the capabilities of Salience beyond analysis of English.

Salience was originally developed by Lexalytics to analyze English text, traditionally from online news articles and blogs. Over the continued development and maturation of the product, more and more complex analysis of online content, including Twitter, has been added and improved.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

The first non-English language we tackled was French. Our first release of support for French provided almost all of the features found in our analysis of English; part-of-speech tagging, novel named entity extraction, theme extraction, summarization, and model-based sentiment analysis. The only feature in our English language support that was not available in this initial release was phrase-based sentiment analysis. As opposed to model-based sentiment analysis, phrase-based sentiment is much more tunable and transparent: you know exactly which phrases influenced the results as well how, and can tune the phrases to your needs accordingly. An update followed shortly which added an  improved handling of French social media content.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

FR (2010)

Y

Y

Y

Model

Y

Y

The next languages we tackled were Spanish and Portuguese, again providing model-based sentiment analysis. These languages have not yet undergone an update to specifically address unique social media complications in those languages, yet they do benefit from the techniques built into Salience itself for adjusting to handle short content.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

FR (2010)

Y

Y

Y

Model

Y

Y

ES (2011)

Y

Y

Y

Model

Y

Y

PT (2011)

Y

Y

Y

Model

Y

Y

With Salience 5.0, Lexalytics introduced the core feature of the Concept Matrix™, a model of semantic relationships derived from Wikipedia. At the same time, we were developing support for German. Additional research and development allowed us to provide phrase-based sentiment analysis for German, our first non-English language pack to include this capability.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Concepts

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

Y

FR (2010)

Y

Y

Y

Model

Y

N

Y

ES (2011)

Y

Y

Y

Model

Y

N

Y

PT (2011)

Y

Y

Y

Model

Y

N

Y

DE (2012)

Y

Y

Y

Phrase/Model

Y

Y

Over the course of the summer of 2012, we worked to develop updates to our existing language packs that would bring all of them up to the same feature level. Updates have been released for French (June 2012), Spanish (September 2012), and Portuguese (October 2012) that bring all currently supported languages up to the same level of functionality.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Concepts

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

Y

FR (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

ES (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

PT (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

DE (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

While our current language packs will see ongoing improvements (better novel entity extraction, better quotation extraction), we have our sights firmly set outside the Romance and Germanic languages. We are hard at work at applying our experience and techniques to Chinese, looking to release our initial support in early 2013.