What is Tokenization in NLP?

Tokenization is an interesting part of text analytics and NLP. A “token” in natural language terms is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” Like the roots and branches of a tree, the whole of human language is a mess of natural outgrowths—split, decaying, vibrant, and blooming. Tokenization is part of the methodology we use when teaching machines about words, the foundational aspect of our most important invention.

How Tokenization in NLP Works

Imagine you want to process a sentence. One approach is to split that sentence into words and punctuation (i.e. tokens). Identifying words is relatively easy in English, as we use spaces. English punctuation can be a little more ambiguous. A period, for example, can denote the end of a sentence, but not always: consider the representations Mr., Ms., or Dr.

Other languages, such as Mandarin, don’t use spaces to mark the separation between words. They require a different approach to identifying what constitutes a word. Still, other languages, like German, deal with verbs in a wholly unique way. For example, if a word has a separable prefix (like “throw away”), German grammar dictates it moves to the end of the sentence. Thus, “I will throw away the trash.”, becomes literally “I will the trash away-throw.”

Understanding Lemmatization in NLP

While the position of verbs is less complicated in English, machines still need to contend with a wide array of inflection endings on words like “caresses,” “saw,” “government’s,” and “deriving.” Each of these examples contains a root word, known as a lemma. To understand what a lemma is, all you need to do is imagine how the word is listed in the dictionary. “Caresses” will be listed as “caress,” “saw” as “see,” and so on. For a machine to increase recall, the words in a dataset need to be lemmatized. This process is an essential part of tokenization.

Of course, like all text analytics, lemmatization is still a game of numbers; it simply doesn’t work every time. Even the most sophisticated lemmatization can sometimes reduce precision. Stanford points out how operating and system might be lemmatized to operate and system. In reality, sentences containing the words operate and system might bear no relation to sentences containing the phrase operating system.

A Game of Numbers

Tokenization and its related processes play a role in making text analytics better than crude heuristics. Nonetheless, teaching a machine human language is complicated; teaching a machine natural language is bananas. When it comes to features that depend on tokenization, like named entity extraction, tuning can’t be emphasized enough. It’s one of the big reasons folks choose to buy text analytics software over building it themselves. This extra human element might one day be supplemented or replaced by AI, but for now it’s a vital part of tokenization and text analytics as a whole.