From entity extraction to document summary, text analytics is a combination of machine learning and natural language processing. Some companies use more of one vs. the other. (e.g. some are all machine learning, and some are all “NLP” & rules). These functions are often cast as arcana, understood only by that ever sexy order of wizards―the data scientists. But this doesn’t have to be the case. In this quick, high-level overview we’re going to drill a level deeper into text analysis. At Lexalytics, we believe that its important to have transparency on the inner workings of text analytics.
In seven minutes you’ll have a much better understanding of what text analytics is and how it works!
If you wanna see these functions at work with no obligations check out our free web demo!
Let’s say you dump a load of Tweets, online reviews, and forum comments into a text analytics engine. The first thing that needs to happen is this unstructured text must be broken down before any analysis can occur. This is much like what we were taught to do as kids in language arts class.
First, you need to know what language this text is in, as each language has its own pecadillos. Spanish? Singlish? Arabic? Lexalytics supports over 22 languages (shameless plug) spanning dozens of alphabets, abjads and logographies. So, basic as it might seem, this subfunction determines the course for the rest of any given analysis! It’s very important to get it right.
Now that we know what language the text is in, we can break it up into tokens. Tokenization is a necessary step in processing text, as it breaks the text apart into tokens that a machine can understand. We use the word “tokens” and not “words” because as well as words, tokens can also be things like:
- punctuation (exclamation points affect sentiment, for instance)
- links (http://…)
- possessive markers
Now, as I said, tokenization is language specific. For most alphabetic languages, tokenization is straightforward—use white space and punctuation to denote a token. English is easy, cuz “spaces.”
Moving East, logographies (a fancy word for character based languages), such as simplified Chinese, have no space breaks between words and tokenizing those languages requires the use of machine learning. Each language has its own tokenization requirements.
3. Sentence Breaking
Once you’ve identified the tokens, you can tell where the sentences end. (See, look at that period right there, you knew exactly where the sentence ended, didn’t you, Dr. Smart?) Now, do you see what I did there? Did the sentence end at that period at the end of “Dr.?” Now check out the punctuation in that sentence – there’s a period and a question mark right at the end of it. Will the madness never end? The point is this – you have to tell where the boundaries are on the sentences before you can figure out things like syntax. Certain communication forms <cough> Twitter <cough> are less friendly than this post, and we have ways of making them work, but we’ll leave that aside for another time.
4. PoS Tagging
Now we’ve identified the language, tokenized, and performed sentence breaking on the text, it’s time to PoS Tag it. I have to admit that I giggle every time that I type “PoS,” but that’s my inner adolescent speaking. Part of Speech tagging (or PoS tagging) is a tagging function tasked with determining the part of speech for every token in a sentence. So, in other words, is a given token a proper or common noun? Is it a verb or an adjective? At Lexalytics, we support 93 separate PoS tags.
Okay, relatively painless so far, right? Let’s plow on to the function known as Chunking, or light parsing as it’s sometimes called. Chunking refers to a range of sentence-breaking systems that splinter a sentence into its component phrases (noun phrases, verb phrases, etc.).
I want to draw a quick distinction between chunking and PoS tagging before we go forward, because there aren’t many answers to this on the internet. My definition goes like this:
- PoS Tagging: Assigning parts of speech to tokens
- Chunking: Assigning PoS tagged tokens to phrases
Here’s what it looks like when it works in practice. Take the sentence:
The tall man is going to quickly walk under the ladder.
The chunking processes will return: [the tall man]_np [is going to quickly walk]_vp [under the ladder]_pp
Where np stands for “noun phrase,” vp stands for “verb phrase,” and pp stands for “prepositional phrase.”
6. Syntax Parsing
We’re moving onto one of the most important steps in this process. Now, I don’t wanna give you people flashbacks to high school, but the truth is syntax parsing is just fancy talk for sentence diagraming. In short, this subfunction determines the structure of each sentence. It is a seriously critical step if we intend to run a sentiment analysis on the text. This becomes clear in the following example:
- Applewas doing poorly until Steve Jobs
- Because Applewas doing poorly, Steve Jobs
- Applewas doing poorly because Steve Jobs
In the first example, Apple is negative, whereas Steve Jobs is positive. In the second, Apple is still negative, but Steve Jobs is now neutral. In the final example both Apple and Steve Jobs are negative. This is one of the most computationally intensive steps, but we’ve developed special unsupervised machine learning based on billions of words of input and matrix factorization to help us, well, cheat. Better put, to help us understand the syntax just like a human would. (Again, this is beyond the scope of this article.)
Okay, onward and upward!
7. Sentence Chaining
Now that you know everything you can about a single sentence, you need to actually relate sentences together so that you can assert sentiment from one sentence onto another.
Lexalytics utilizes a sentence chaining technique called “lexical chaining.” Lexical chaining links individual sentences to each other by each sentence’s strength of association to an overall topic. Even if sentences appear many paragraphs apart in a document, the chain will flow through the document and allows the overall “feel” to be detected and quantified by the machine.