Build or Buy for Text Analytics & Natural Language Processing?

  6 m, 45 s

Let’s say you’re a software developer or architect tasked with adding text analytics/natural language processing (NLP) functionality to your company’s product. Should you build or buy?

In this article I’ll outline the dilemma and the pros and cons of both options.

TL; DR Version for Busy People

Building your own text analytics/NLP system:

  • Cost: $200,000+ (hiring an engineer with NLP skills + other developers)
  • Time: months to years
  • Usefulness: very limited
  • Headaches: massive

Buying an off-the-shelf text analytics/NLP system:

  • Cost: <$20,000 (avg. simple) to $150,000 (avg. complex)
  • Time: days to weeks
  • Usefulness: easily customizable
  • Headaches: none (unless you’re dehydrated)

Summary: If you need to get the best possible insights from text data as quickly and as cheaply as possible, choose an off-the-shelf text analytics/NLP system (in the cloud or on-premise depending on your needs).

Related article: Cloud or On-Premise for Data Analytics?

Now, let’s explore this in more depth.

What is a “Build or Buy” Question?

[Build or Buy - Car.png]A build or buy question is a choice of whether to build your own version of something, or buy a pre-built (“off-the-shelf”) solution from another company.

You and I are faced with a lot of build or buy questions every week. The decisions we make each time depend on the details of each circumstance. But there are some common threads we can follow.

For example, a while back, I decided to build my own computer.

I did some research, went shopping, and then did some more Googling to figure out how to put the parts together.

It was a lot of fun (partly because I’m a huge electronics geek). But am I writing this article from that computer?

No. I’m on my store-bought Macbook. 

Because when I need to get work done, I can’t wait until I’ve figured out how to troubleshoot the blue screen of death or whatever other issue is currently plaguing my cobbled-together machine.

Now, if I were only interested in how computers are built, or if I wanted to become a computer repair technician, it may be worth my time to fix the computer I built myself.

But when I get to work, my goal isn’t to learn how to become a computer repairman; it’s to open a Word file and get to work writing this article. For that, I need a computer that just works.

Alternatively, think about building (or fixing up) versus buying a car.

Sure, it might be fun to build my own internal combustion engine from scratch. But I need to get to the grocery store today.

In fact, the same basic logic applies to most “build or buy” questions. Which brings us to your choice to build or buy for text analytics.

The Barebones Basics of Text Analytics

Much like a car, any text analytics system worth its salt involves a huge number of complex moving parts. When you buy an off-the-shelf solution, a lot of these are hidden (unless you go rooting around the engine bay).

But if you’re going to build a text analytics system from scratch, you have to account for all of them. Here are the 7 basic functions of text analytics:

  1. Language Identification
  2. Tokenization
  3. Sentence Breaking
  4. Part of Speech Tagging
  5. Chunking
  6. Syntax Parsing
  7. Sentence Chaining

Each of these serves a vital role in accomplishing larger natural language processing features:

  • Sentiment analysis
  • Named entity recognition
  • Categorization (topics and themes)
  • Intention extraction
  • Summarization

Here’s a simplified view of the text analytics/NLP feature stack at Lexalytics (we’re actually at 25 languages now):

Fig. 1 – Lexalytics’ text analytics technology and NLP feature stack, showing the layers of processing each text document goes through to be transformed into structured data.

All this, and we haven’t even begun to discuss the role of machine learning in natural language processing.

Language detection, Part of Speech tagging, and named entity recognition all require machine learning models to achieve reasonable accuracy. Each model must be trained on a data set consisting of tens of thousands of hand-tagged documents.

Now, if you want to analyze more than one language? Every new language will require its own trained model. And you’ll have to keep updating them as people start using words in new and weird ways.

Point is, text analytics is a complicated beast. But let’s try to simplify.

The Cost of Building a Basic Sentiment Analysis System

If you’re satisfied with a barebones tool, it’s not that difficult to configure a basic rules-based sentiment analysis system.

(What is rules-based sentiment analysis?)

In fact, if you buy a pre-tagged sentiment library and ignore some of the more complex text analytics functions, it’ll go pretty quickly.

But no matter what, it’ll take you at least 12 to 18 weeks just to generate a basic document sentiment score.

To make matters worse, it turns out that purely rules-based sentiment analysis has a lot of drawbacks.

For one, sentiment scores can be extremely misleading when taken out of context. Which means that a basic document-level sentiment score is a dangerous beast.

(And let’s be honest: your amateur rules-based system is going to break almost as soon as you hit “go”.)

In the end, to build a useful sentiment analysis system, you’ll need to use both natural language processing and machine learning. And that’ll require a heavy investment. Remember, data scientists don’t come cheap!

Building your own text analytics/NLP system:

  • Cost: $200,000+ (hiring an NLP engineer + other developers)
  • Time: months to years
  • Usefulness: very limited
  • Headaches: massive

The Benefits of Buying Off-the-Shelf Text Analytics/NLP

Why spend hundreds of thousands of dollars and months or years building something fragile and barely usable? Especially when you can get better results today by buying an off-the-shelf system.

Companies like Lexalytics offer cloud APIs and on-premise software libraries that are built for easy integration. DataSift, for example, integrated our on-premise Salience engine in just 4 days.

When you let someone else do the work of paying constant attention to all the moving parts that make up a text analytics system, you have a lot more time to build cool products and take care of customers. In short, you have a lot more time to do your job.

What’s more, most of these pre-built offerings have already been heavily optimized for scalability.

[Build or Buy - Car Repair.png]Our own on-premise Salience and cloud Semantria products are great examples. Both tools process billions of documents a day around the world, and they do it fast. We’re able to handle these loads because we’ve spent years tweaking and tuning our software to handle complex language as efficiently as possible.

Lastly, your off-the-shelf text analytics systems will offer easy customizability and tuning. Salience and Semantria, for example, give you access to the underlying configuration files. This means you can shape our solutions to slot right in to your platform, dashboard or application.

Buying an off-the-shelf text analytics/NLP solution:

  • Cost: <$20,000 (simple, low-volume cloud processing) to $150,000 (complex, high-volume on-premise installation)
  • Time: days to weeks
  • Usefulness: easily customizable
  • Headaches: none (unless you’re dehydrated)

Build or Buy For Text Analytics/NLP?

So, should you build or buy for text analytics? Let’s write up some quick pros and cons.

Building a Text Analytics System

Pros: great opportunity to learn as much as you can about text analytics and natural language processing

Cons: time-consuming and extremely expensive, particularly in the long run; probably won’t result in a useable text analytics system

Buying an NLP Solution

Pros: fast results; cost-effective; frees you up to focus on solving other business problems

Cons: maybe not as fun (if you’re interested in learning how NLP works)

[Build or Buy - happy, whistling car driver.png]Summary

If you need the best possible insights from your text data as quickly and as cheaply as possible, choose an off-the-shelf text analytics/NLP solution (in the cloud or on-premise depending on your needs). This choice will free you up to worry about other business problems, such as increasing revenue, reducing churn, and building cool products.

If you’d like further discussion of whether you should build or buy for text analytics, download our white paper: Build or Buy for Text Analytics

Related article: Cloud or On-Premise for Data Analytics?

Categories: Technology, Text Analytics