Build or Buy for Natural Language Processing?

  7 m, 36 s

Let’s say you’re an engineer or data scientist tasked with adding natural language processing (NLP) to your company’s products or building a custom NLP application. Or maybe you’re a data analyst who needs to perform your own NLP analyses on a bunch of unstructured text (surveys, comments, etc.). Should you build your own NLP system or buy “off-the-rack”?

In this article I’ll outline the dilemma and the pros and cons of both options.

TL; DR Version for Busy People

Building your own NLP system from the ground up:

  • Cost: $200,000+ (hiring an engineer with NLP skills + other developers)
  • Time: months to years
  • Usefulness: very limited without major additional work
  • Headaches: massive

Working with an experienced NLP vendor:

  • Cost: $20,000 (basic text analytics and visualization) to $150,000+ (semi-custom NLP application)
  • Time: weeks to months
  • Usefulness: customized to your specific needs
  • Headaches: none (unless you’re dehydrated)

Summary: If you need to build a custom NLP application to address a specific challenge for your company, or if you need to get detailed insights from complex text data, work with an experienced NLP vendor. If you just need basic document-level sentiment analysis on a handful of documents, go with a basic off-the-shelf tool (or, if you’re feeling adventurous, try to patch together your own system from various open source components).

Now, let’s explore this in more depth.

What is a “Build or Buy” Question?

[Build or Buy - Car.png]A build or buy question is a choice of whether to build your own version of something, or buy a pre-built (“off-the-shelf”) solution from another company.

You and I face a lot of build or buy questions every week. The decisions we make each time depend on the details of each circumstance. But there are some common threads we can follow.

For example, a while back, I decided to build my own computer.

I did some research, went shopping, and then did some more Googling to figure out how to put the parts together.

It was a lot of fun (partly because I’m a huge electronics geek). But am I writing this article from that computer?

No. I’m on my store-bought Macbook. 

Because when I need to get work done, I can’t wait until I’ve figured out how to troubleshoot the blue screen of death or whatever other issue is currently plaguing my cobbled-together machine.

Now, if I were only interested in how computers are built, or if I wanted to become a computer repair technician, it may be worth my time to fix the computer I built myself.

But when I get to work, my goal isn’t to learn how to become a computer repairman; it’s to open a Word file and get to work writing this article. For that, I need a computer that just works.

Alternatively, think about building (or fixing up) versus buying a car.

Sure, it might be fun to build my own internal combustion engine from scratch. But I need to get to the grocery store today.

In fact, the same basic logic applies to most “build or buy” questions. Which brings us to your choice to build or buy for text analytics.

The Barebones Basics of Text Analytics

Much like a car, any text analytics system worth its salt involves a huge number of complex moving parts. When you buy an off-the-shelf solution, a lot of these are hidden (unless you go rooting around the engine bay).

But if you’re going to build a text analytics system from scratch, you have to account for all of them. Here are the 7 basic functions of text analytics:

  1. Language Identification
  2. Tokenization
  3. Sentence Breaking
  4. Part of Speech Tagging
  5. Chunking
  6. Syntax Parsing
  7. Sentence Chaining

Each of these serves a vital role in accomplishing larger natural language processing features:

  • Sentiment analysis
  • Named entity recognition
  • Categorization (topics and themes)
  • Intention extraction
  • Summarization

Here’s a simplified view of the text analytics/NLP feature stack at Lexalytics:

Lexalytics NLP Technology Stack.png

Fig. 1 – Lexalytics’ text analytics technology and NLP feature stack, showing the layers of processing each text document goes through to be transformed into structured data.

All this, and we haven’t even begun to discuss the role of machine learning in natural language processing.

Language detection, Part of Speech tagging, and named entity recognition all require machine learning models to achieve reasonable accuracy. Each model must be trained on a data set consisting of tens of thousands of hand-tagged documents.

Now, if you want to analyze more than one language? Every new language will require its own trained model. And you’ll have to keep updating them as people start using words in new and weird ways.

Point is, text analytics is a complicated beast. But let’s try to simplify.

The Cost of Building a Basic Sentiment Analysis System

[man with 3 sentiment bubbles above his head.png]If you’re satisfied with a barebones tool, it’s not that difficult to configure a basic rules-based sentiment analysis system.

(What is rules-based sentiment analysis?)

In fact, if you work from open source or buy a pre-tagged sentiment library and ignore some of the more complex text analytics functions, it can go pretty quickly. In fact, open source can be a solid route to choose if your needs are few and you have the technical knowledge required.

But anything beyond a barebones document-level sentiment scoring system could easily take you 12 to 18 weeks or more to build, test, and tune.

To make matters worse, it turns out that barebones sentiment analysis has a lot of drawbacks. For example, sentiment scores can be extremely misleading when taken out of context. Which means that purely rules-based document-level sentiment can be downright dangerous.

To build a more robust system with deeper NLP functions, such as category sentiment or theme analysis, you’ll need to use both NLP and machine learning. And that’ll require a heavy investment.

According to Glassdoor, the average salary for a US-based NLP engineer is more than $80,000 (not including benefits and bonuses).

<Glassdoor NLP Engineer Average Salary.png><noscript><img src= alt=" class="aligncenter"/>

And hiring a Data Scientist who can handle the machine learning/AI side of things will be even more costly – well into the six-figures:

<Glassdoor Data Scientist Average Salary.png><noscript><img src= alt=" class="aligncenter"/>

So, to build your own semi-custom NLP application that’s actually capable of delivering what you need, you’re looking at:

  • Cost: $200,000+ (hiring an NLP engineer + other developers)
  • Time: months to years
  • Usefulness: very limited without major additional work
  • Headaches: massive

The Benefits of Working With an NLP Vendor

Why invest hundreds of thousands and months or years building an in-house system, when you can get better results today by working with an experienced NLP vendor?

Companies like Lexalytics offer cloud APIs and on-premise software libraries that are built for easy integration. DataSift, for example, integrated our on-premise Salience engine in just 4 days.

Related article: How to Choose an AI Vendor: 4 Questions to Ask

What’s more, instead of just selling you a suit off the rack, we’ll tailor it to meet your exact requirements and goals (hence “semi-custom application”). We combine an established NLP platform with an experienced professional services team to build NLP solutions that are perfectly tailored to your team, organization or industry.

Why does this matter? Because when you let someone else do the work of paying constant attention to all the moving parts that make up a text analytics system, you have a lot more time to build cool products and take care of customers. In short, you have a lot more time to do your job.

[Build or Buy - Car Repair.png]What’s more, solutions like ours have already been heavily optimized for scalability. This means that as your operation grows, so will your NLP system, without performance disruption.

Lastly, many off-the-shelf NLP solutions are easy to customize and tune. Our platform, for example, can give you access to a graphical configuration interface or even the underlying configuration files. This means you can fit our solutions into your own dashboard or application, all on your own.

Buying an off-the-shelf NLP/text analytics solution:

  • Cost: $20,000 (simple, low-volume cloud processing) to $150,000+ (complex semi-custom application)
  • Time: weeks to months
  • Usefulness: customized to your specific needs
  • Headaches: none (unless you’re dehydrated)

Build or Buy For Natural Language Processing?

So, should you build or buy for text analytics? Let’s write up some quick pros and cons.

Building a natural language processing system

Pros: great opportunity to learn as much as you can about text analytics and natural language processing

Cons: time-consuming and extremely expensive, particularly in the long run; probably won’t result in a useable text analytics system

Buying an NLP solution/working with a vendor

Pros: fast results; cost-effective; frees you up to focus on solving other business problems

Cons: maybe not as fun (if you’re interested in learning how NLP works)

[Build or Buy - happy, whistling car driver.png]Summary

If you need the best possible insights from your text data as quickly and as cheaply as possible, choose an off-the-shelf text analytics/NLP solution (in the cloud or on-premise depending on your needs). This choice will free you up to worry about other business problems, such as increasing revenue, reducing churn, and building cool products.

If you’d like further discussion of whether you should build or buy for text analytics, download our white paper: Build or Buy for Text Analytics

Related article: Cloud or On-Premise for Data Analytics?

Categories: Technology, Text Analytics