BERT Explained: Next-Level Natural Language Processing

Lexalytics, an InMoment company, routinely integrates the latest open source NLP advancements into our technology stack. Most recently, a new transfer learning technique called BERT (short for Bidirectional Encoder Representations for Transformers) made big waves in the NLP research space. Fundamentally, BERT excels at handling what might be described as “context-heavy” language problems.

BERT NLP In a Nutshell

Historically, Natural Language Processing (NLP) models struggled to differentiate words based on context. For example:

He wound the clock.


Her mother’s scorn left a wound that never healed.

Previously, text analytics relied on embedding methods that were quite shallow. In this case, “embedding” is the process of mapping a discrete value (like the word “wound”) onto a continuous vector. Within these traditional embedding methods, a given word could only be assigned one vector. In other words, the vector for “wound” needs to include information about clocks as well as all things to do with injuries. BERT is different; it tries to map vectors onto words after reading the entire sentence.

So How Does It Work?

BERT takes an entirely different approach to learning. Basically, BERT is given billions of sentences at training time. It’s then asked to predict a random selection of missing words from these sentences. After practicing with this corpus of text several times over, BERT adopts a pretty good understanding of how a sentence fits together grammatically. It’s also better at predicting ideas that are likely to show up together. This is how it excels at dealing with homonyms, like “wound.”

BERT Accelerates NLP Model Building

“Language modeling – although it sounds formidable – is essentially just predicting words in a blank.”

Keita Kurita, Computational Data Science Post-Graduate Carnegie Mellon University

BERT is open source, and all of this encoded information is available when you deploy it. This makes it a great asset for building models! It means that you can achieve state-of-the-art accuracy, or get comparable accuracy to older algorithms, with a tenth of the amount of data.

If you’d like to learn more about the deep learning behind Lexalytics and general NLP, read our in-depth explanation of Machine Learning for Natural Language Processing