Machine learning for natural language processing and text analytics involves using machine learning algorithms and “narrow” artificial intelligence (AI) to understand the meaning of text documents. These documents can be just about anything that contains text: social media comments, online reviews, survey responses, even financial, medical, legal and regulatory documents. In essence, the role of machine learning and AI in natural language processing (NLP) and text analytics is to improve, accelerate and automate the underlying text analytics functions and NLP features that turn this unstructured text into useable data and insights.
For those who don’t know me, I’m the Chief Scientist at Lexalytics. We sell text analytics and NLP solutions, but at our core we’re a machine learning company. We maintain hundreds of supervised and unsupervised machine learning models that augment and improve our systems. And we’ve spent more than a decade gathering data sets and experimenting with new algorithms.
In this article, I’ll start by exploring some machine learning approaches for natural language processing. Then I’ll discuss how to apply machine learning to solve problems in natural language processing and text analytics.
- Background: Machine Learning in the Context of Natural Language Processing
- Supervised Machine Learning for NLP
- Unsupervised Machine Learning for NLP
- Background: What is Natural Language Processing?
- ML vs NLP and Using Machine Learning on Natural Language Sentences
- Hybrid Machine Learning Systems for NLP
- Summary & Further Reading
Machine Learning for Natural Language Processing
Before we dive deep into how to apply machine learning and AI for NLP and text analytics, let’s clarify some basic ideas.
Most importantly, “machine learning” really means “machine teaching.” We know what the machine needs to learn, so our task is to create a learning framework and provide properly-formatted, relevant, clean data for the machine to learn from.
When we talk about a “model,” we’re talking about a mathematical representation. Input is key. A machine learning model is the sum of the learning that has been acquired from its training data. The model changes as more learning is acquired.
Unlike algorithmic programming, a machine learning model is able to generalize and deal with novel cases. If a case resembles something the model has seen before, the model can use this prior “learning” to evaluate the case. The goal is to create a system where the model continuously improves at the task you’ve set it.
Machine learning for NLP and text analytics involves a set of statistical techniques for identifying parts of speech, entities, sentiment, and other aspects of text. The techniques can be expressed as a model that is then applied to other text, also known as supervised machine learning. It also could be a set of algorithms that work across large sets of data to extract meaning, which is known as unsupervised machine learning. It’s important to understand the difference between supervised and unsupervised learning, and how you can get the best of both in one system.Text data requires a special approach to machine learning. This is because text data can have hundreds of thousands of dimensions (words and phrases) but tends to be very sparse. For example, the English language has around 100,000 words in common use. But any given tweet only contains a few dozen of them. This differs from something like video content where you have very high dimensionality, but you have oodles and oodles of data to work with, so, it’s not quite as sparse.
Supervised Machine Learning for Natural Language Processing and Text Analytics
In supervised machine learning, a batch of text documents are tagged or annotated with examples of what the machine should look for and how it should interpret that aspect. These documents are used to “train” a statistical model, which is then given un-tagged text to analyze.
Later, you can use larger or better datasets to retrain the model as it learns more about the documents it analyzes. For example, you can use supervised learning to train a model to analyze movie reviews, and then later train it to factor in the reviewer’s star rating.
The most popular supervised NLP machine learning algorithms are:
- Support Vector Machines
- Bayesian Networks
- Maximum Entropy
- Conditional Random Field
- Neural Networks/Deep Learning
All you really need to know if come across these terms is that they represent a set of machine learning algorithms that are guided along in some way by a human data scientist.
Lexalytics uses supervised machine learning to build and improve our core text analytics functions and NLP features.
Tokenization involves breaking a text document into pieces that a machine can understand, such as words. Now, you’re probably pretty good at figuring out what’s a word and what’s jibbersih. English is especially easy. See all this white space between the letters and paragraphs? That makes it really easy to tokenize. So, NLP rules are sufficient for English tokenization.
But how do you teach a machine learning algorithm what a word looks like? And what if you’re not working with English-language documents? Logographic languages like Mandarin Chinese have no whitespace.
This is where we use machine learning for tokenization. Chinese follows rules and patterns just like English, and we can train a machine learning model to identify and understand them.
Part of Speech Tagging
Part of Speech Tagging (PoS tagging) means identifying each token’s part of speech (noun, adverb, adjective, etc.) and then tagging it as such. PoS tagging forms the basis of a number of important Natural Language Processing tasks. We need to correctly identify Parts of Speech in order to recognize entities, extract themes, and to process sentiment. Lexalytics has a highly-robust model that can PoS tag with >90% accuracy, even for short, gnarly social media posts.
Named Entity Recognition
At their simplest, named entities are people, places, and things (products) mentioned in a text document. Unfortunately, entities can also be hashtags, emails, mailing addresses, phone numbers, and Twitter handles. In fact, just about anything can be an entity if you look at it the right way. And don’t get us stated on tangential references.
At Lexalytics, we’ve trained supervised machine learning models on large amounts pre-tagged entities. This approach helps us to optimize for accuracy and flexibility. We’ve also trained NLP algorithms to recognize non-standard entities (like species of tree, or types of cancer).
It’s also important to note that Named Entity Recognition models rely on accurate PoS tagging from those models.
Sentiment analysis is the process of determining whether a piece of writing is positive, negative or neutral, and then assigning a weighted sentiment score to each entity, theme, topic, and category within the document. This is an incredibly complex task that varies wildly with context. For example, take the phrase, “sick burn” In the context of video games, this might actually be a positive statement.
Creating a set of NLP rules to account for every possible sentiment score for every possible word in every possible context would be impossible. But if you train a machine learning model on pre-scored data, it can learn to understand what “sick burn” means in the context of video gaming, versus in the context of healthcare. Unsurprisingly, each language requires its own sentiment classification model.
Categorization and Classification
Categorization means sorting content into buckets to get a quick, high-level overview of what’s in the data. To train a text classification model, data scientists use pre-sorted content and gently shepherd their model until it’s reached the desired level of accuracy. The result is accurate, reliable categorization of text documents that takes far less time and energy than human analysis.
Unsupervised Machine Learning for Natural Language Processing and Text Analytics
Unsupervised machine learning involves training a model without pre-tagging or annotating. Some of these techniques are surprisingly easy to understand.
Clustering means grouping similar documents together into groups or sets. These clusters are then sorted based on importance and relevancy (hierarchical clustering).
Another type of unsupervised learning is Latent Semantic Indexing (LSI). This technique identifies on words and phrases that frequently occur with each other. Data scientists use LSI for faceted searches, or for returning search results that aren’t the exact search term.
For example, the terms “manifold” and “exhaust” are closely related documents that discuss internal combustion engines. So, when you Google “manifold” you get results that also contain “exhaust”.
Matrix Factorization is another technique for unsupervised NLP machine learning. This uses “latent factors” to break a large matrix down into the combination of two smaller matrices. Latent factors are similarities between the items.
Think about the sentence, “I threw the ball over the mountain.” The word “threw” is more likely to be associated with “ball” than with “mountain”.
In fact, humans have a natural ability to understand the factors that make something throwable. But a machine learning NLP algorithm must be taught this difference.
Unsupervised learning is tricky, but far less labor- and data-intensive than its supervised counterpart. Lexalytics uses unsupervised learning algorithms to produce some “basic understanding” of how language works. We extract certain important patterns within large sets of text documents to help our models understand the most likely interpretation.
The Lexalytics Concept Matrix™ is, in a nutshell, unsupervised learning applied to the top articles on Wikipedia™. Using unsupervised machine learning, we built a web of semantic relationships between the articles. This web allows our text analytics and NLP features to understand that “apple” is close to “fruit” is close to “tree”, is far away from “lion”, but is closer to “lion” than it is to “linear algebra.” Unsupervised learning, through the Concept Matrix™, forms the basis of our understanding of semantic information (remember our discussion above).
Our Syntax Matrix™ is unsupervised matrix factorization applied to a massive corpus of content (many billions of sentences). The Syntax Matrix™ helps us understand the most likely parsing of a sentence – forming the base of our understanding of syntax (again, recall our discussion earlier in this article).
Background: What is Natural Language Processing?
Natural Language Processing broadly refers to the study and development of computer systems that can interpret speech and text as humans naturally speak and type it. Human communication is frustratingly vague at times; we all use colloquialisms, abbreviations, and don’t often bother to correct misspellings. These inconsistencies make computer analysis of natural language difficult at best. But in the last decade, both NLP techniques and machine learning algorithms have progressed immeasurably.
There are three aspects to any given chunk of text:
Semantic information is the specific meaning of an individual word. A phrase like “the bat flew through the air” can have multiple meanings depending on the definition of bat: winged mammal, wooden stick, or something else entirely? Knowing the relevant definition is vital for understanding the meaning of a sentence.
Another example: “Billy hit the ball over the house.” As the reader, you may assume that the ball in question is a baseball, but how do you know? The ball could be a volleyball, a tennis ball, or even a bocce ball. We assume baseball because they are the balls most often “hit” in such a way, but without machine learning of natural language a computer wouldn’t.
The second key component of text is sentence or phrase structure, known as syntax information. Take the sentence, “Sarah joined the group already with some search experience.” Who exactly has the search experience here? Sarah, or the group? Depending on how you read it, the sentence has very different meaning with respect to Sarah’s abilities.
Finally, you must understand the context that a word, phrase, or sentence appears in. What is the concept being discussed? If a person says that something is “sick”, are they talking about healthcare or video games? The implication of “sick” is often positive when mentioned in a context of gaming, but almost always negative when discussing healthcare.
ML vs NLP and Using Machine Learning on Natural Language Sentences
Let’s return to the sentence, “Billy hit the ball over the house.” Taken separately, the three types of information would return:
- Semantic information: person – act of striking an object with another object – spherical play item – place people live
- Syntax information: subject – action – direct object – indirect object
- Context information: this sentence is about a child playing with a ball
These aren’t very helpful by themselves. They indicate a vague idea of what the sentence is about, but full understanding requires the successful combination of all three components.
This analysis can be accomplished in a number of ways, through machine learning models or by inputting rules for a computer to follow when analyzing text. Alone, however, these methods don’t work so well.
Machine learning models are great at recognizing entities and overall sentiment for a document, but they struggle to extract themes and topics; what’s more, they’re less-than-adept at referring sentiment back to individual entities or themes.
Alternatively, you can teach your system to identify the basic rules and patterns of language. In many languages, a proper noun followed by the word “street” probably denotes a street name. Similarly, a number followed by a proper noun followed by the word “street” is probably a street address. And people’s names usually follow generalized two- or three-word formulas of proper nouns and nouns.
Unfortunately, recording and implementing language rules takes a lot of time. What’s more, NLP rules can’t keep up with the evolution of language. The Internet has butchered traditional conventions of the English language. And no static NLP codebase can possibly encompass every inconsistency and meme-ified misspelling on social media.
Very early text mining systems were entirely based on rules and patterns. Over time, as natural language processing and machine learning techniques have evolved, an increasing number of companies offer products that rely exclusively on machine learning. But as we just explained, both approaches have major drawbacks.
That’s why at Lexalytics, we utilize a hybrid approach. We’ve trained a range of supervised and unsupervised models that work in tandem with rules and patterns that we’ve been refining for over a decade.
Hybrid Machine Learning Systems for NLP
Our text analysis functions are based on patterns and rules. Each time we add a new language, we begin by coding in the patterns and rules that the language follows. Then our supervised and unsupervised machine learning models keep those rules in mind when developing their classifiers. We apply variations on this system for low-, mid-, and high-level text functions.
Low-level text functions are the initial processes through which you run any text input. These functions are the first step in turning unstructured text into structured data; thus, these low-level functions form the base layer of information from which our mid-level functions draw on. Mid-level text analytics functions involve extracting the real content of a document of text. This means who is speaking, what they are saying, and what they are talking about.
The high-level function of sentiment analysis is the last step, determining and applying sentiment on the entity, theme, and document levels.
- Tokenization: ML + Rules
- PoS Tagging: Machine Learning
- Chunking: Rules
- Sentence Boundaries: ML + Rules
- Syntax Analysis: ML + Rules
- Entities: ML + Rules to determine “Who, What, Where”
- Themes: Rules “What’s the buzz?”
- Topics: ML + Rules “About this?”
- Summaries: Rules “Make it short”
- Intentions: ML + Rules “What are you going to do?”
- Intentions uses the syntax matrix to extract the intender, intendee, and intent
- We use ML to train models for the different types of intent
- We use rules to whitelist or blacklist certain words
- Multilayered approach to get you the best accuracy
- Apply Sentiment: ML + Rules “How do you feel about that?”
You can see how this system pans out in the chart below:
Language is messy and complex. Meaning varies from speaker to speaker and listener to listener. Machine learning can be a good solution for analyzing text data. In fact, it’s vital – purely rules-based text analytics is a dead-end. But it’s not enough to use a single type of machine learning model. Certain aspects of machine learning are very subjective. You need to tune or train your system to match your perspective.
The best way to do machine learning for NLP is a hybrid approach: many types of machine learning working in tandem with pure NLP code.
For best practices in machine learning for NLP, read our companion article: Machine Learning Micromodels: More Data is Not Always Better
To learn about the difference between tuning and training, and how to approach them, read our guide: Tune First, Then Train
And, to learn more about general machine learning for NLP and text analytics, read our full white paper on the subject.
Want to get started with machine learning? Try these machine learning tutorials: Over 200 of the Best Machine Learning, NLP, and Python Tutorials
Read about the difference between ML and AI: What’s The Difference Between Machine Learning And Artificial Intelligence?