What is Text Analytics – Part 1

  2 m, 3 s

I’m sure this seems a strange title for a post, given that Lexalytics and this blog have been around for quite a while now, but I’ve been noticing a shift in the questions I get about our technology. Until recently most of the folks that we talked to had a very good idea of what a text analytics engine provided, but as Text Analytics has gone more mainstream I’ve seen an uptick in folks needing a briefing on the basic definitions of the various features like classification, categorization, entity recognition, themes, concepts, sentiment and relationships. So, I’m going to use a couple of blog posts to provide a basic overview of these concepts. I’m going to start with Classification, Categorization, Themes and Concepts, and then use my next post to review Entity Extraction, Sentiment Scoring and Relationships. Fair warning… I may raise the blood pressure of a few librarians with my rather loose definitions, but I think people need general use definitions to help them map these concepts to their day to day business needs. So let me start by annoying the librarians right out of the blocks… For most of the world Categorization and Classification are the same thing. I’m going to stick with the term classification, but it could just as easily be categorization. The simple definition is:

            • Classification is basically a group of technologies designed to fill in the missing piece of the following: “This document is about BLANK”.

Of course its rarely as easy as that, but the basic idea of Classification/Categorization is to put a given document into a subject oriented bucket. These buckets may be organized as a flat set of topics, or as a more complex hierarchy on of taxonomy nodes, like:

    • Sports
      • Baseball
      • Football
    • Basketball

Themes or Concepts are sometimes confused with Classification/Categorization but are an entirely different animal. Themes are noun phrases that occur within a document. As an example consider a document about the current state of the economy. The document might be classified as “financial results”, but the themes or concepts from the document would include things like:

    • banking crisis
    • toxic assets
    • stimulus plan
    • bad assets

The use of classification/categorization and concepts/themes are 2 different mechanisms to help users understand what a document is about. A great way to examine and understand these 2 different text analytics features is to try them out on a site like Newssift where you can navigate financial news content via both categories and concepts.