TECH
Machine Learning
First and foremost, what is machine learning, and why is it a good thing? Machine learning is a set of statistical/mathematical tools and algorithms for training a computer to perform a specific task, for example, recognizing faces.
Two important words here are “training” and “statistical.” Training because you are literally teaching the computer about a particular task. Statistical because the computer is working with probabilistic math, and the chances of it getting the answer “correct” varies with the type and complexity of the question that it’s being trained to answer.
There are a number of different types of machine learning algorithms, from the simple “Naïve Bayes” to “Neural Networks” to “Maximum Entropy” and “Decision Trees.” We’re more than happy to geek on out with you with respect to advantages and disadvantages of different types, and talk about linear vs. non-linear learning, feed-forward systems, or argue about multi-layer hidden networks vs. explicitly exposing each layer.
But that’s for a different time and place. We’re going to keep this high-level here. We would suggest reading both the following whitepaper and blog post: Machine learning and Build vs. buy to go one level down from here.
Lexalytics is a machine learning company. We maintain dozens of both supervised and unsupervised machine learning models. (Close to 40, actually.) We have dozens of person-years dedicated to gathering data sets, experimenting with the state of the art machine learning algorithms, and producing models that balance accuracy, broad applicability, and speed.
Lexalytics is not a general-purpose machine learning company. We are not providing you with generic algorithms that can be tuned for any machine-learning problem. We are entirely, completely, and totally focused on text. All of our machine learning algorithms, models, and techniques are optimized to help you understand the meaning of text content.
Text content requires special approaches from a machine learning perspective, in that it can have hundreds of thousands of potential dimensions to it (words, phrases, etc), but tends to be very sparse in nature (say you’ve got 100,000 words in common use in the English language, in any given tweet you’re only going to get say 10-12 of them). This differs from something like video content where you have very high dimensionality, but you have oodles and oodles of data to work with, so, it’s not quite as sparse.
Why is this an issue? Because how can you start grouping things together and seeing trends unless you can understand the similarities between content.
In order to deal with the specific complications of text, we use what’s called a “hybrid” approach. Meaning, that unlike pure-play machine learning companies, we use a combination of machine learning, lists, pattern files, dictionaries, and natural language algorithms. In other words, rather than just having a variety of hammers (different machine learning algorithms), we have a nice tool belt full of different sorts of tools, each tool optimal for the task at hand.
The “term du jour” seems to be “deep learning” – which is an excellent rebranding of “neural networks.” Basically, the way that deep learning works is that there are several layers that build up on top of each other in order to recognize a whole. For example, if dealing with a picture, layer 1 would see a bunch of dots, layer 2 would recognize a line, layer 3 would recognize corners connecting the lines, and the top layer would recognize that this is a square.
This explanation is an abstraction of what happens inside of deep learning for text – the internal layers are opaque math. We have taken a different approach that we believe to be superior to neural networks/deep learning – explicitly layered extraction. We have a multi-layered process for preparing the text that helps reduce the sparseness and dimensionality of the content – but as opposed to the hidden layers in a deep learning model, our layers are explicit and transparent. You can get access to every one of them and understand exactly what is happening at each step.
To give an idea of the machine learning models we have, just to process a document in English, we have the following machine-learning models:
- Part of Speech tagging
- Chunking
- Sentence Polarity
- Concept Matrix (Semantic Model)
- Syntax Matrix (Syntax Parsing)
All of those models help us deal with that dimensionality/sparseness problem listed above. Now, we have to actually extract stuff, so, we’ve got additional models for
- Named Entity Extraction
- Anaphora Resolution (Associating pronouns with the right words)
- Document Sentiment
- Intention Extraction
- Categorization
For other languages, like Mandarin Chinese, we have to actually figure out what a word is, so, we need to “tokenize” – which is another machine learning task.
Some of our customers, particularly in the market analytics space and the customer experience management space have been hand-coding categories of content for years. This means that they have a lot of content that is bucketed into different categories. Which means that they have a really great set of content for training a machine-learning based classifier – we can do that for you too!
But, and this is a really big but, it is inefficient to do all tasks with the same tool. That’s why we also have dictionaries and pattern files, and all sorts of other good stuff like that. To sum up why we use a hybrid approach, let’s take the following example… Say you’ve trained up a sentiment classifier using 50,000 documents that does a pretty good job of agreeing with a human as to whether something is positive, negative, or neutral. Awesome!
What happens when a review comes in that it scores incorrectly? There are 2 approaches – sometimes you have a feedback loop, and sometimes you have to collect a whole corpus of content and retrain the model.
Even in the case of the feedback loop, the behavior of the model isn’t going to change immediately, and it can be unpredictable – because you’re just going to tell it “this document was scored incorrectly, it should be positive” – and the model is going to take all of the words into account that are actually in the model itself.
In other words, it’s like you’ve got a big ocean liner. You can start to turn it, but it’s going to take a while and a lot of feedback before it turns. In our approach, you simply look to see what phrases were marked positive and negative, change them as appropriate, and then you’re done. The behavior changes instantly.
We like to think of it as the best of both worlds, and we think you will too.