Tokenization in Two Minutes

  2 m, 27 s

Tokenization is an interesting part of text analytics. A “token” in natural language terms is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” Like the roots and branches of a tree, the whole of human language is a mess of natural outgrowths—split, decaying, vibrant, and blooming. Tokenization is part of the methodology we use when teaching machines about words, the foundational part of our most important invention.

How Tokenization Works

Imagine you want to process a sentence. One approach is to split that sentence into words and punctuation (i.e. tokens). Identifying words is relatively easy in English, as we use spaces. English punctuation can be a little more ambiguous. A period, for example, can denote the end of a sentence, but not always: consider the representations Mr., Ms. or Dr.

Other languages, such as Mandarin, don’t use spaces to mark the separation between words. They require a different approach to identifying what constitutes a word. Still other languages, like German, deal with verbs in a wholly unique way. For example, if a word has a separable prefix (like “throw away”), then German grammar dictates it will be moved to the end of the sentence. Thus, “I will throw away the trash.”, becomes literally “I will the trash away-throw.”

Understanding Lemmatization

While the position of verbs is less complicated in English, machines still need to contend with a wide array of inflection endings on words like “caresses,” “saw,” “government’s,” and “deriving.” Each of these examples contain a root word, known as a lemma. To understand what a lemma is, all you need to do is imagine how the word will be listed in the dictionary. “Caresses” will be listed as “caress,” “saw” as “see,” and so on. For a machine to increase recall the words in a dataset need to be lemmatized. This process is an important part of tokenization.

Of course, like all of text analytics, lemmatization is still a game of numbers, it simply doesn’t work every time. Even the most sophisticated lemmatization can sometimes reduce precision. Stanford points out how operating and system might be lemmatized to operate and system. In reality, sentences containing the words operate and system might bare no relation to sentences containing the words operating system.

A Game of Numbers

Tokenization and its itinerant processes play a role in making text analytics better than crude heuristics. Nonetheless, teaching a machine human language is complicated; teaching a machine natural language is bananas. When it comes to features which depend on tokenization, like named entity extraction, tuning can’t be emphasized enough. It’s one of the big reasons folks choose to buy text analytics software over building it themselves. This extra human element might one day be supplemented or replaced by AI, but for now it’s a vital part of tokenization and text analytics as a whole.

Categories: Technology