Build vs Buy: Text Analytics & NLP

  8 m, 35 s

Let’s say you’re an engineer or data scientist tasked with adding natural language processing (NLP) to your company’s products or building a custom application that needs some sort of text analytics capability. Or maybe you’re a data analyst who needs to perform text analysis on a bunch of unstructured documents (surveys, comments, etc.). Should you build your own NLP system or buy “off-the-rack”?

In this article I’ll outline the dilemma and the pros and cons of both options.

Summary & Key Take-Aways

Building your own NLP system from the ground up:

  • System cost: $0 (open source) to $100,000+ (off-the-shelf components)
  • Expertise cost: $81,000+ (hiring someone with NLP skills + other developers)
  • Time: months
  • Capabilities: limited without major additional work

Buying and integrating an NLP solution:

  • System cost: $10,000 (basic text analytics) to $150,000+ (custom NLP application)
  • Expertise cost: None (your vendor will do most the work)
  • Time: days or weeks
  • Capabilities: both deep and broad, custom to your needs

If you take nothing else away from this article, let it be this:

For detailed insights from text documents, such as multi-layered sentiment analysis, complex categorization or entity recognition on ambiguous words and phrases, buy and integrate a solution from an experienced NLP vendor.

To run basic document-level sentiment analysis, categorization or entity extraction on a handful of documents, go with a basic cloud NLP tool or, if you have the time and expertise, go ahead and build your own from open source.

What is a “Build or Buy” Question?

[Build or Buy - Car.png]A build or buy question is a choice of whether to build your own version of something, or buy a pre-built (“off-the-shelf”) solution from another company.

People face build or buy questions every week: Cook dinner or order delivery? Manage my own investments or pay someone to do it for me? Do my own laundry or bring it to a cleaner? The choices we make each time depend on the details of each circumstance.

For example, imagine you need a new work computer. One option is to build your own. You can do some research about what parts to buy, go shopping, and then do some more Googling to figure out how to put everything together.

This can be a lot of fun, especially if you’re an electronics geek. But is building your own computer a good choice for your primary work machine?

Probably not. Because when you need to get work done, you can’t wait until you’ve figured out how to troubleshoot the blue screen of death or whatever other issue is currently plaguing your cobbled-together machine.

Now, if you’re interested in how computers are built, or if you have specific requirements that can’t be met from off-the-shelf machine, it may be worth your time to build and fix it yourself.

But your job at the office isn’t to hone your computer-building capabilities; it’s to turn the thing on and get back to whatever it is you’d been working on the day before. For that, you need a computer that just works.

Another way to frame this is as building or fixing up a car versus buying one. Sure, it might be fun to build your own internal combustion engine from scratch. But you need to get to the grocery store today. The same basic logic applies to the choice of “build or buy” for text analytics and natural language processing.

The Basics of Text Analytics

Much like a car, any text analytics system worth its salt involves a huge number of complex moving parts. When you buy a complete off-the-shelf solution, most of these are taken care of by the vendor. But if you’re going to build a text analytics system from scratch, you have to account for all of them. Here are the 7 basic functions of text analytics:

  1. Language Identification
  2. Tokenization
  3. Sentence Breaking
  4. Part of Speech Tagging
  5. Chunking
  6. Syntax Parsing
  7. Sentence Chaining

Each of these serves a vital role in accomplishing larger natural language processing features:

To illustrate this, here’s a simplified view of the text analytics/NLP feature stack at Lexalytics:

Lexalytics NLP Technology Stack.png

This diagram shows the layers of processing each text document goes through to be transformed into structured data.

All this, and we haven’t even begun to discuss the role of machine learning in natural language processing.

Language detection, Part of Speech tagging, and named entity recognition all require machine learning models to achieve reasonable accuracy. Each model must be trained on a data set consisting of tens of thousands of hand-tagged documents.

And if you want to analyze more than one language, each and every one will require its own models. And you’ll have to keep updating them as people start using words in new and weird ways.

The point is, NLP is more than just “great = positive, bad = negative”. But maybe you just need a basic sentiment analysis tool. What will that cost you to build?

The Cost of Building a Sentiment Analysis System

[man with 3 sentiment bubbles above his head.png]It’s not that difficult to configure a basic rules-based sentiment analysis system. If you work from open source and ignore some of the more complex text analytics functions, it can go pretty quickly. In fact, this can be a good option if you have the required technical knowledge and don’t mind devoting hundreds or thousands of person-hours to the project.

(What is rules-based sentiment analysis?)

Similarly, most of the Big Tech companies (including Google, Amazon, and Microsoft) offer cloud NLP services. If your analytical needs are simple, you don’t need on-premise processing, and you’re not working with very many documents, these systems can be an efficient and cost-effective choice.

But the fact is that both of these approaches are inherently limited. Open source NLP is good for simple use cases, but the cost/benefit analysis is clear: creating a team and actually building out the capabilities is prohibitively expensive and time-consuming for any but the largest enterprises. Similarly, cloud providers are good at solving low-volume use cases where you only need one or two basic NLP features. But when you need more complex analytics, or if this is a core strategic capability, they simply won’t support you.

To make matters worse, it turns out that basic text analytics, such as document-level sentiment analysis, has a lot of drawbacks. For example, sentiment scores can be extremely misleading when taken out of context. Which means that purely rules-based document-level sentiment can be downright dangerous if you’re making core strategic decisions based on that data.

To build a more robust system with deeper NLP functions, such as category sentiment or theme analysis, you’ll need to use both NLP and machine learning. And that’ll require a heavy investment. According to Glassdoor, the average salary for a US-based NLP engineer is more than $80,000 (not including benefits and bonuses).

Glassdoor NLP Engineer Average Salary.png

And hiring a Data Scientist who can handle the machine learning/AI side of things will be even more costly – well into the six-figures:

Glassdoor Data Scientist Average Salary.png

So, to build your own NLP system that’s actually capable of delivering what you need, you’re looking at a hundred thousand dollars or more just in expertise costs, in addition to months of labor and headaches.

  • System cost: $0 (open source) to $100,000+ (off-the-shelf components)
  • Expertise cost: $81,000+ (hiring someone with NLP skills + other developers)
  • Time: months
  • Capabilities: very limited without major additional work

Buying & Integrating an NLP Solution

Why invest hundreds of thousands and months or years building an in-house system, when you can get better results today by working with an experienced NLP vendor?

Companies like Lexalytics offer cloud APIs and on-premise software libraries that are built for easy integration. DataSift, for example, integrated our on-premise Salience engine in just 4 days.

Related article: How to Choose an AI Vendor: 4 Questions to Ask

What’s more, instead of just selling you a suit off the rack, we’ll tailor it to meet your exact requirements and goals (hence “semi-custom application”). We combine an established NLP platform with an experienced professional services team to build NLP solutions that are perfectly tailored to your team, organization or industry.

Why does this matter? Because when you let someone else do the work of paying constant attention to all the moving parts that make up a text analytics system, you have a lot more time to build cool products and take care of customers. In short, you have a lot more time to do your job.

[Build or Buy - Car Repair.png]What’s more, solutions like ours have already been heavily optimized for scalability. This means that as your operation grows, so will your NLP system, without performance disruption.

Lastly, many off-the-shelf NLP solutions are easy to customize and tune. Our platform, for example, gives you access to a graphical configuration interface or even the underlying configuration files. This means you can fit our solutions into your own dashboard or application, all on your own.

Buying and integrating off-the-shelf NLP/text analytics:

  • System cost: $10,000 (basic analytics) to $150,000+ (custom NLP application)
  • Expertise cost: none (your vendor will do most of the work)
  • Time: days or weeks
  • Capabilities: deep and broad, custom to your needs

Build or Buy For Natural Language Processing?

So, should you build or buy for text analytics and NLP? Let’s write up some quick pros and cons.

Building a natural language processing system

Pros: great opportunity to learn as much as you can about text analytics and natural language processing

Cons: time-consuming and extremely expensive, particularly in the long run; probably won’t result in a useable text analytics system

Buying and integrating an NLP solution from a vendor like Lexalytics

Pros: fast results; cost-effective; frees you up to focus on solving other business problems

Cons: maybe not as fun (if you’re interested in learning how NLP works)

[Build or Buy - happy, whistling car driver.png]Summary

If you need deep insights or specific analytical capabilities, such as custom categorization or on-premise processing, it’s actually more cost-effective to buy and an off-the-shelf text analytics solution. Choosing to buy and integrate will free you up to focus on your real work, such as increasing revenue, reducing churn, managing regulatory compliance, and solving process automation challenges.

If you’d like further discussion of whether you should build or buy for text analytics, download our white paper: Build or Buy for Text Analytics

Categories: Technology, Text Analytics