(Note: Updated July 2, 2015)
What is Machine Learning?
Machine Learning (in the context of text analytics) is a set of statistical techniques for identifying some aspect of text (parts of speech, entities, sentiment, etc). The techniques can be expressed as a model that is then applied to other text (supervised), or could be a set of algorithms that work across large sets of data to extract meaning (unsupervised).
Supervised Machine Learning
Take a bunch of documents that have been “tagged” for some feature, like parts of speech, entities, or topics (classifiers). Use these documents as a training set that produces a statistical model. Apply this model to new text. Retrain model on larger/better dataset to get improved results. This is the type of Machine Learning that Lexalytics uses. This sort of “supervised” approach also applies to the sort of re-training that can happen with some models where some viewer gives a “star” rating – and the algorithm adds that rating to its ongoing processing.
Some model types you’ll see fairly frequently
- Support Vector Machines
- Bayesian Networks
- Maximum Entropy
- Conditional Random Field
- Neural Networks/Deep Learning
Unsupervised Machine Learning
These are statistical techniques to tease meaning out of a collection of text without pre-training a model. Some examples are:
- Clustering: Groups “like” documents together into sets (clusters) of documents
- Latent Semantic Indexing
- Extracts important words/phrases that occur in conjunction with each other in the text
- Used for faceted search or returning search results that aren’t exactly the phrases used for searches. So if the terms “manifold” and “exhaust” are closely related in lots of documents, if you search for “manifold”, you’ll get documents back that contain the word “exhaust”
- Matrix Factorization:
- A technique that allows you to “factor” down a very large matrix into the combination of two smaller matrices, using what are called “latent factors”. Latent factors are similarities between the items. Think about it like this – if you see the word “throw” in a sentence, that’s probably going to be more likely to be associated with the word “ball” than the word “mountain” – I threw the ball over the mountain. And you know that because of your natural ability to understand the factors that make something “throwable.”
- Matrix factorization came to prominence with the Netflix® challenge – where teams competed for a million dollar prize to increase the recommendation ratings accuracy by 10%.
How Lexalytics uses Machine Learning:
Lexalytics uses a combination of supervised and unsupervised machine learning.
We use supervised machine learning for a number of natural language processing tasks, including the following things (plus more):
- English is really easy – see all those spaces? That makes it really easy to tokenize – in other words, to determine what’s a word. So we just use a simple set of rules for English tokenization.
- 中国有没有空格，所以机器学习是符号化的重要。 (Chinese has no whitespace, so machine learning is important for tokenization.)
- Part of Speech tagging
- We use Parts of Speech for a number of important Natural Language Processing tasks. We need to know them to recognize Named Entities, to extract themes, to process sentiment. So, Lexalytics has a highly robust model for doing PoS tagging with >90% accuracy, even for short, gnarly social media posts.
- Named Entity Recognition
- We use a machine learning model that we’ve trained on large amounts of content to recognize People, Places, Companies, and Product entities. It is important to note that the Named Entity Recognition model requires Parts of Speech as an input feature, so, this model is reliant on the Part of Speech tagging model.
- We also have another machine learning model that can be trained by our customers to recognize entity types that we don’t include in our pre-trained model. (e.g. “trees” or “types of cancer”)
- We’ve built a number of sentiment classification models for different languages
Our customers can come to us with a set of pre-tagged content (say they’ve been hand-analyzing surveys for a while, and they have a bunch of categories with associated tagged content) – and we can train up a model that exactly matches how they’ve been scoring content.
Lexalytics uses unsupervised learning to produce some “basic understanding” of how language works. In order to interpret the meaning of a set of words, you need three things: semantics, syntax, and context. We can extract certain important patterns with large enough corpora of text in order to help us make the most likely interpretation.
- Concept Matrix™:
- Unsupervised learning done across Wikipedia™ – allows us to understand things like “apple” is close to “fruit” is close to “tree” but is far away from “lion”, but is closer to “lion” than it is to, say, “linear algebra.”
- The base of our semantic understanding
- Syntax Matrix™
- Unsupervised learning done on a massive corpus of content (many billions of sentences) using the aforementioned matrix factorization technique
- Helps us understand the most likely parsing of a sentence – the base of our understanding of syntax
In a sentence: we work to develop our natural language processing capabilities by utilizing machine learning techniques that work in tandem with traditional rules and patterns to work through a series of low-, mid-, and high-level text functions that deduce the semantic, syntax, and context information found in any block of unstructured input text.
Read on for more detail of how we actually handle natural language processing, and how we feel that a hybrid approach between machine learning and rules is really the best way to go.
Natural Language Processing broadly refers to the study and development of computer systems that can interpret speech and text as humans naturally speak and type it (hence the “natural” part). Human communication is frustratingly vague at times; we all use colloquialisms, abbreviations, and don’t bother to correct misspellings (especially on the internet). These inconsistencies make computer analysis of natural language difficult at best, but in the last decade NLP as a field has progressed immeasurably. At Lexalytics, we’re focused on the text side of things; and we’ve been developing and innovating since 2004.
There are three aspects of a given chunk of text, each of which must be understood:
Semantic information is the specific meaning of an individual word. A phrase like “the bat flew through the air” can have multiple meanings depending on the definition of bat: winged mammal, wooden stick, or something else entirely? Knowing which definition is relevant is vital for understanding the meaning of a sentence.
Another example: “Billy hit the ball over the house.” As the reader, you may assume that the ball in question is a baseball, but how do you know? The ball could be a volleyball, a tennis ball, or even a bocce ball.
The second key component of text is sentence or phrase structure, known as syntax information. Take the sentence, “Sarah joined the group already with some search experience.” Who exactly has the search experience here? Sarah, or the group? Depending on how you read it, the sentence has very different meaning with respect to Sarah’s abilities.
Finally, you must understand the context that a word, phrase, or sentence appears in. What is the concept being discussed? If a person says that something is “sick”, are they talking about healthcare or video games? The implication of “sick” is often positive when mentioned in a context of gaming, but almost always negative when discussing healthcare.
Let’s return to the sentence, “Billy hit the ball over the house.” Taken separately, the three types of information would return results that run along the lines of:
- Semantic information: person – act of striking an object with another object – spherical play item – place people live
- Syntax information: subject – action – direct object – indirect object
- Context information: this sentence is about a child playing with a ball
These aren’t very helpful by themselves. They indicate a vague idea of what the sentence is about, but full understanding requires that the three components be combined.
This analysis can be accomplished in a number of ways, through machine learning models or by inputting rules for a computer to follow when analyzing text. Alone, however, these methods don’t work so well. Machine learning models are great at recognizing entities and overall sentiment for a document, but they struggle to extract themes and topics; what’s more, they’re less-than-adept at referring sentiment back to individual entities or themes.
Alternatively, you can teach your system a number of rules and patterns to identify. Much of text language follows some basic rules and patterns laid down by the language the text is written in. In many languages for example, a proper noun followed by the word “street” is probably denoting the name of a street; similarly, a number followed by a proper noun followed by the word “street” is probably a street address (versus an email address, you can see how syntax information is important). People’s names usually follow generalized two- or three-word formulas of proper nouns and nouns.
But recording and implementing these rules and patterns takes an exorbitant amount of time, and you must be painstaking in your definitions. What’s more, rules and patterns cannot possibly keep up with the evolution of language: the Internet has absolutely butchered traditional conventions of the English language, and no set of rules can possibly encompass every inconsistency and new language trend that pops up in your input text.
Very early text mining systems were entirely based on rules and patterns. On the other side, as natural language processing and machine learning techniques have evolved over the last decade an increasing number of companies have popped up offering software that relies exclusively on machine learning methods. As explained just above, these systems can only offer limited insight.
That’s why at Lexalytics, we utilize a variety of supervised and unsupervised models that work in tandem with a number of rules and patterns we’ve spent years refining. By taking this hybrid approach, our systems are infinitely customizable to return the exact level of detail our customers desire.
So here’s how that hybrid system works.
Our text analysis functions are based on patterns and rules. Each time we add a new language to our capabilities, we begin by inputting the patterns and rules that language traditionally follows. Then our supervised and unsupervised machine learning models keep those rules in mind when developing their classifiers. We apply variations on this system for low-, mid-, and high-level text functions.
Low-level text functions are the initial processes any text input is run through. These functions are the first step in turning unstructured text into structured data; thus these low-level functions form the base layer of information from which our mid-level functions draw on. Those mid-level text functions involve extracting the real content of a document of text, determining who is speaking, what they are saying, and what they are talking about. The high-level function of sentiment analysis is the final step, determining and applying sentiment on the entity, theme, and document levels.
- Tokenization: ML + Rules
- PoS Tagging: Machine Learning
- Chunking: Rules
- Sentence Boundaries: ML + Rules
- Syntax Analysis: ML + Rules
- Entities: ML + Rules to determine “Who, What, Where”
- Themes: Rules “What’s the buzz?”
- Topics: ML + Rules “About this?”
- Summaries: Rules “Make it short”
- Intentions: ML + Rules “What are you going to do?”
- Intentions uses the syntax matrix to extract the intender, intendee, and intent
- We use ML to train models for the different types of intent
- We use rules to whitelist or blacklist certain words
- Multilayered approach to get you the best accuracy
- Apply Sentiment: ML + Rules “How do you feel about that?”
You can see how this system pans out in the flowchart below:
We’ve optimized each of these functions to return the most accurate, most reliable results. Through adoption of a hybrid approach to text analytics, utilizing machine learning models in tandem with preset rules and patterns, our text mining software is fluent in natural language of all types: even emojis, smiley faces, and acronyms. Our text analytics solutions provide more and better contextual information than any other offering on the market, and through innovative technologies like the Concept Matrix and Syntax Matrix we continue to lead the way in text analytics development.