I'm going to write a few blog articles to show how machine learning and natural language processing techniques are used in partnership inside of Lexalytics software.
What is Machine Learning?
Machine Learning (in the context of text analytics) is a set of statistical techniques for identifying some aspect of text (parts of speech, entities, sentiment, etc). The techniques can be expressed as a model that is then applied to other text (supervised), or could be a set of algorithms that work across large sets of data to extract meaning (unsupervised).
Supervised Machine Learning
Take a bunch of documents that have been “tagged” for some feature, like parts of speech, entities, or topics (classifiers). Use these documents as a training set that produces a statistical model. Apply this model to new text. Retrain model on larger/better dataset to get improved results. This is the type of Machine Learning that Lexalytics uses. This sort of “supervised” approach also applies to the sort of re-training that can happen with some models where some viewer gives a “star” rating – and the algorithm adds that rating to its ongoing processing.
Some model types you’ll see fairly frequently
- Support Vector Machines
- Bayesian Networks
- Maximum Entropy
- Conditional Random Field
Unsupervised Machine Learning
These are statistical techniques to tease meaning out of a collection of text without pre-training a model. Some examples are:
- Clustering: Groups “like” documents together into sets (clusters) of documents
- Latent Semantic Indexing
- Extracts important words/phrases that occur in conjunction with each other in the text
- Used for faceted search or returning search results that aren’t exactly the phrases used for searches. So if the terms “manifold” and “exhaust” are closely related in lots of documents, if you search for “manifold”, you’ll get documents back that contain the word “exhaust”
How Lexalytics uses Machine Learning:
Lexalytics only uses Supervised Machine Learning, as our system is used for reporting and trending applications as well as for information retrieval. Unsupervised Machine Learning applications tend to be not very useful for trending/reporting, as the clusters and information reported will change with the clusters, and also because the Unsupervised algorithms can’t work on single documents. So, if you have a document that is in set A, and the same document in set B, the clusters that you retrieve can be different in set B, even though that document is exactly the same.
There are 4 primary applications for which we use Supervised Machine Learning:
- Part of Speech tagging
- We use Parts of Speech for a number of important Natural Language Processing tasks (more on that in the next blog post), we need to know them to recognize Named Entities, to extract themes, to process sentiment. So, Lexalytics has a highly robust model for doing PoS tagging.
- Named Entity Recognition
- We use a Maximum-Entropy model that we’ve trained on large amounts of news content to recognize People, Places, Companies, and Product entities. It is important to note that the Named Entity Recognition model requires Parts of Speech as an input feature, so, this model is reliant on the Part of Speech tagging model.
- We also have a Conditional Random Field model that can be trained by our customers to recognize entity types that we don’t include in our Maximum Entropy model. (e.g. “trees” or “types of cancer”)
- Document Sentiment
- This is not a commonly used feature, but we do have the ability to gauge document sentiment based on a statistical model.
- Sentence-based Sentiment for non-English languages
- We use phrase-based sentiment in English, because we have access to certain lexical resources that we don’t in other languages. For other languages, we’ve developed a sentence-based statistical model for sentiment.
Next up: Natural Language Processing