We’re happy to announce yet another Native Language Pack for Salience, and I’m happy to announce that I learned a new word today. The language is Korean, and the word is “agglutinative”. Let me explain:
Korean is very distinct, because unlike Chinese or Japanese, Korean characters are phonetic rather than pictographic. Meaning that Korean characters represent sounds, where as Chinese and Japanese characters represent a complete idea or word. As such, Korean presented a whole host of different difficulties for text analysis.
The wonderful Carl Lambrecht took some time out of his busy schedule to give me the scoop on developing Korean language support. According to him, Korean is not only phonetic, but also agglutinative. Agglutinative, my new favorite word, describes a language where you can mash words together to make new words. Instead of having the adjective be a separate word, for example, the adjective and the noun it is modifying are smooshed into one word. Germanic languages like, oh say, German, tend to be agglutinative as well.
Sometimes when the words are combined, they bleed into each other, making it very difficult for a text mining engine to separate and identify them. This required adding to our tag set so Salience could identify that a word is, say, a noun with some sort of suffix attached to it.
Each new language presents its own unique challenges, and our engineers are constantly finding new and creative ways to conquer those challenges. Korean is no different. Just as always, our latest language addition uses a native language approach, enlisting native speakers to develop our files, and parsing in the original language, so nothing is ever lost in translation.