Build or Buy for Natural Language Processing?

  7 m, 22 s

Let’s say you’re an engineer or data scientist tasked with building a customized natural language processing (NLP) application or adding NLP to your company’s products. Or maybe you’re a data analyst who needs to get useful insights from a bunch of unstructured text (surveys, comments, etc.). Should you build or buy?

In this article I’ll outline the dilemma and the pros and cons of both options.

TL; DR Version for Busy People

Building your own NLP system:

  • Cost: $200,000+ (hiring an engineer with NLP skills + other developers)
  • Time: months to years
  • Usefulness: very limited without major additional work
  • Headaches: massive

Working with an experienced NLP vendor:

  • Cost: $20,000 (basic text analytics and visualization) to $150,000+ (semi-custom application)
  • Time: weeks to months
  • Usefulness: customized to your specific needs
  • Headaches: none (unless you’re dehydrated)

Summary: If your goal is a customized NLP application that solves a specific challenge for your company, or if you just need to get the best possible insights from text data as quickly as possible, work with an experienced NLP vendor rather than trying to build your own system.

Now, let’s explore this in more depth.

What is a “Build or Buy” Question?

[Build or Buy - Car.png]A build or buy question is a choice of whether to build your own version of something, or buy a pre-built (“off-the-shelf”) solution from another company.

You and I face a lot of build or buy questions every week. The decisions we make each time depend on the details of each circumstance. But there are some common threads we can follow.

For example, a while back, I decided to build my own computer.

I did some research, went shopping, and then did some more Googling to figure out how to put the parts together.

It was a lot of fun (partly because I’m a huge electronics geek). But am I writing this article from that computer?

No. I’m on my store-bought Macbook. 

Because when I need to get work done, I can’t wait until I’ve figured out how to troubleshoot the blue screen of death or whatever other issue is currently plaguing my cobbled-together machine.

Now, if I were only interested in how computers are built, or if I wanted to become a computer repair technician, it may be worth my time to fix the computer I built myself.

But when I get to work, my goal isn’t to learn how to become a computer repairman; it’s to open a Word file and get to work writing this article. For that, I need a computer that just works.

Alternatively, think about building (or fixing up) versus buying a car.

Sure, it might be fun to build my own internal combustion engine from scratch. But I need to get to the grocery store today.

In fact, the same basic logic applies to most “build or buy” questions. Which brings us to your choice to build or buy for text analytics.

The Barebones Basics of Text Analytics

Much like a car, any text analytics system worth its salt involves a huge number of complex moving parts. When you buy an off-the-shelf solution, a lot of these are hidden (unless you go rooting around the engine bay).

But if you’re going to build a text analytics system from scratch, you have to account for all of them. Here are the 7 basic functions of text analytics:

  1. Language Identification
  2. Tokenization
  3. Sentence Breaking
  4. Part of Speech Tagging
  5. Chunking
  6. Syntax Parsing
  7. Sentence Chaining

Each of these serves a vital role in accomplishing larger natural language processing features:

  • Sentiment analysis
  • Named entity recognition
  • Categorization (topics and themes)
  • Intention extraction
  • Summarization

Here’s a simplified view of the text analytics/NLP feature stack at Lexalytics:

Lexalytics NLP Technology Stack.png

Fig. 1 – Lexalytics’ text analytics technology and NLP feature stack, showing the layers of processing each text document goes through to be transformed into structured data.

All this, and we haven’t even begun to discuss the role of machine learning in natural language processing.

Language detection, Part of Speech tagging, and named entity recognition all require machine learning models to achieve reasonable accuracy. Each model must be trained on a data set consisting of tens of thousands of hand-tagged documents.

Now, if you want to analyze more than one language? Every new language will require its own trained model. And you’ll have to keep updating them as people start using words in new and weird ways.

Point is, text analytics is a complicated beast. But let’s try to simplify.

The Cost of Building a Basic Sentiment Analysis System

If you’re satisfied with a barebones tool, it’s not that difficult to configure a basic rules-based sentiment analysis system.

(What is rules-based sentiment analysis?)

In fact, if you work from open source or buy a pre-tagged sentiment library and ignore some of the more complex text analytics functions, it’ll go pretty quickly.

But no matter what, we estimate that it’ll take you at least 12 to 18 weeks to generate a basic document sentiment score with any acceptable reliability.

To make matters worse, it turns out that purely rules-based sentiment analysis has a lot of drawbacks. For example, sentiment scores can be extremely misleading when taken out of context. Which means that document-level sentiment on its own can be dangerous.

In the end, to build a useful sentiment analysis system, you’ll need to use both natural language processing and machine learning. And that’ll require a heavy investment.

According to Glassdoor, the average salary for a US-based NLP engineer is more than $80,000 (not including benefits and bonuses).

<Glassdoor NLP Engineer Average Salary.png><noscript><img src=

And hiring a Data Scientist who can handle the machine learning/AI side of things will be even more costly – well into the six-figures:

<Glassdoor Data Scientist Average Salary.png><noscript><img src=

So, to build your own semi-custom NLP application that’s actually capable of delivering what you need, you’re looking at:

  • Cost: $200,000+ (hiring an NLP engineer + other developers)
  • Time: months to years
  • Usefulness: very limited without major additional work
  • Headaches: massive

The Benefits of Working With an NLP Vendor

Why invest hundreds of thousands and months or years building an in-house system, when you can get better results today by working with an experienced NLP vendor?

Companies like Lexalytics offer cloud APIs and on-premise software libraries that are built for easy integration. DataSift, for example, integrated our on-premise Salience engine in just 4 days.

Related article: How to Choose an AI Vendor: 4 Questions to Ask

What’s more, companies like Lexalytics combine established NLP platforms with experienced professional services teams. Instead of just selling you a suit off the rack, we’ll tailor it to meet your exact requirements and goals (hence “semi-custom application”).

When you let someone else do the work of paying constant attention to all the moving parts that make up a text analytics system, you have a lot more time to build cool products and take care of customers. In short, you have a lot more time to do your job.

What’s more, most of these pre-built offerings have already been heavily optimized for scalability.

[Build or Buy - Car Repair.png]Our own on-premise Salience and cloud Semantria products are great examples. Both tools process billions of documents a day around the world, and they do it fast. We’re able to handle these loads because we’ve spent years tweaking and tuning our software to handle complex language as efficiently as possible.

Lastly, your off-the-shelf text analytics systems will offer easy customizability and tuning. Salience and Semantria, for example, give you access to the underlying configuration files. This means you can shape our solutions to slot right in to your platform, dashboard or application.

Buying an off-the-shelf NLP/text analytics solution:

  • Cost: $20,000 (simple, low-volume cloud processing) to $150,000+ (complex semi-custom application)
  • Time: weeks to months
  • Usefulness: customized to your specific needs
  • Headaches: none (unless you’re dehydrated)

Build or Buy For Natural Language Processing?

So, should you build or buy for text analytics? Let’s write up some quick pros and cons.

Building a natural language processing system

Pros: great opportunity to learn as much as you can about text analytics and natural language processing

Cons: time-consuming and extremely expensive, particularly in the long run; probably won’t result in a useable text analytics system

Buying an NLP solution/working with a vendor

Pros: fast results; cost-effective; frees you up to focus on solving other business problems

Cons: maybe not as fun (if you’re interested in learning how NLP works)

[Build or Buy - happy, whistling car driver.png]Summary

If you need the best possible insights from your text data as quickly and as cheaply as possible, choose an off-the-shelf text analytics/NLP solution (in the cloud or on-premise depending on your needs). This choice will free you up to worry about other business problems, such as increasing revenue, reducing churn, and building cool products.

If you’d like further discussion of whether you should build or buy for text analytics, download our white paper: Build or Buy for Text Analytics

Related article: Cloud or On-Premise for Data Analytics?

Categories: Technology, Text Analytics