LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Jeff Catlin on Tue, 2012-02-14 00:40

I’ve long believed that Text Analytics was going to “pop” one day and start showing up in all sorts of applications.  I talked with a company yesterday that made me believe that day may be coming soon.  The company is “ScreenRetriever” (www.screenretriever.com <http://www.screenretriever.com> ) and they make a really cool screen monitoring application for parents to track their teens computer usage on things like Skype and Facebook.  As the father of two late teen boys I can say firsthand that it’s something I worry about, though thankfully I have no horror stories on that front.  ScreenRetriever has done a really nice job of putting together a clean and easy to use application that lets you see what your kids are seeing and saying.
 
Text Analytics comes in as a potential new add-on feature for services like this, where the text that’s flowing through these applications can be analyzed to spot worrisome patterns, dramatic changes in groups of friends or possibly new themes in their discussions.  A pre-built taxonomy and fixed look at the world probably isn’t going to get it done in this space, the kids are constantly coming up with new acronyms, so it would have to be a “data driven” application where the software discovered interesting terms and threads of discussion.  I can’t say if this feature will ever be added, but what peaked my interest is how different this use case is from the standard media monitoring and customer satisfaction applications that Text Analytics is typically found in.  The technology is getting good enough to start showing up in lots of different verticals, so I’m expecting to see some really interesting use cases in the next year or so.

Submitted by Seth Redmore on Wed, 2012-02-08 20:28

I'm going to write a few blog articles to show how machine learning and natural language processing techniques are used in partnership inside of Lexalytics software.

What is Machine Learning?

Machine Learning (in the context of text analytics) is a set of statistical techniques for identifying some aspect of text (parts of speech, entities, sentiment, etc).   The techniques can be expressed as a model that is then applied to other text (supervised), or could be a set of algorithms that work across large sets of data to extract meaning (unsupervised).

Supervised Machine Learning

Take a bunch of documents that have been “tagged” for some feature, like parts of speech, entities, or topics (classifiers).   Use these documents as a training set that produces a statistical model.   Apply this model to new text.  Retrain model on larger/better dataset to get improved results.  This is the type of Machine Learning that Lexalytics uses.   This sort of “supervised” approach also applies to the sort of re-training that can happen with some models where some viewer gives a “star” rating – and the algorithm adds that rating to its ongoing processing.

Some model types you’ll see fairly frequently

  • Support Vector Machines
  • Bayesian Networks
  • Maximum Entropy
  • Conditional Random Field

 

Unsupervised Machine Learning

These are statistical techniques to tease meaning out of a collection of text without pre-training a model.  Some examples are:

  • Clustering:  Groups “like” documents together into sets (clusters) of documents
  • Latent Semantic Indexing
    • Extracts important words/phrases that occur in conjunction with each other in the text
    • Used for faceted search or returning search results that aren’t exactly the phrases used for searches.  So if the terms “manifold” and “exhaust” are closely related in lots of documents, if you search for “manifold”, you’ll get documents back that contain the word “exhaust”

 

How Lexalytics uses Machine Learning:

Lexalytics only uses Supervised Machine Learning, as our system is used for reporting and trending applications as well as for information retrieval.  Unsupervised Machine Learning applications tend to be not very useful for trending/reporting, as the clusters and information reported will change with the clusters, and also because the Unsupervised algorithms can’t work on single documents.  So, if you have a document that is in set A, and the same document in set B, the clusters that you retrieve can be different in set B, even though that document is exactly the same.

There are 4 primary applications for which we use Supervised Machine Learning:

  • Part of Speech tagging
    • We use Parts of Speech for a number of important Natural Language Processing tasks (more on that in the next blog post), we need to know them to recognize Named Entities, to extract themes, to process sentiment.   So, Lexalytics has a highly robust model for doing PoS tagging.
    • Named Entity Recognition
      • We use a Maximum-Entropy model that we’ve trained on large amounts of news content to recognize People, Places, Companies, and Product entities.  It is important to note that the Named Entity Recognition model requires Parts of Speech as an input feature, so, this model is reliant on the Part of Speech tagging model.
      • We also have a Conditional Random Field model that can be trained by our customers to recognize entity types that we don’t include in our Maximum Entropy model.   (e.g. “trees” or “types of cancer”)
      • Document Sentiment
        • This is not a commonly used feature, but we do have the ability to gauge document sentiment based on a statistical model.
        • Sentence-based Sentiment for non-English languages
          • We use phrase-based sentiment in English, because we have access to certain lexical resources that we don’t in other languages.   For other languages, we’ve developed a sentence-based statistical model for sentiment.

 

Next up:   Natural Language Processing

Submitted by Jeff Catlin on Tue, 2012-01-24 12:47

So I watched an interesting video on some of the mechanically automated editorial work that is happening on major sites like Google and Facebook (see the video here: http://www.youtube.com/watch?v=bOE1HFEL8XA).  What’s interesting about this talk isn’t the idea that these sites are making decisions about what content you will and won’t see, it’s that they aren’t even telling their users that these decisions are being made. 

I found it fascinating and kind of scary that a Facebook newsfeed would automatically prune out all of this guys “republican” content in favor of his “democratic” content simply because he clicked on the democratic links more frequently.  I already worry that as a society we spend too much time seeking out opinions like our own and far too little time seeking out dissenting opinions. 

Now we have the technology helping us down this road.

As a company that provides technology that is often used for this sort of automated editorial work I feel it’s important to examine the effects of our work to assure that we’re not doing more harm than good.  Summarization and Dominant Ideas are the sort of features that are absolutely required in the world we live in, there is simply too much information flowing by for us to read all of it, so using technology to reduce the stream to a manageable volume isn’t only just convenient, it’s absolutely necessary. 

The trick is to make sure people understand the potential negatives so that they can make intelligent decisions about how to acquire and digest content.

Submitted by Seth Redmore on Mon, 2011-11-07 07:12

In honor of the Sentiment Analysis Symposium this week in San Francisco (you are going to be there, right?), here's a summary of best practices for tuning sentiment.   These will work for any sentiment analysis system, but you should use ours.

Because it's the best, 'natch.

1) 2 datasets:  Gather a set of documents and split it in half. 

2) 2 people:  Have 2 people tag each dataset for sentiment, and have 2 people participate in the process of scoring sentiment bearing phrases, that way you can mitigate the risk of slanting the tuning too much towards one person's biases.

3) 2 tests:  After re-modelling or modifying the sentiment scores of sentiment bearing phrases, test against the half of our dataset that you did not use for step one.  Check to see how well it agrees with the tagging that the two of you assigned.  Then, for your second test, run it against the first half to make sure that you didn't make things worse.  You probably didn't, but "mistakes can be made".

Hope to see you in San Francisco in a few days...

Happy Sentimenting!

Submitted by Seth Redmore on Tue, 2011-10-25 18:18

In about 6 minutes, I'll show you how easy it is to configure a concept topic that classifies documents by two different classifiers:

1) Is a country mentioned that is in the Middle East?
2) Are there weapons mentioned?  

(Watch in fullscreen unless you have bionic vision and can see what I'm typing in that tiny window below.)

Submitted by Seth Redmore on Thu, 2011-10-06 22:07

I got kinda dissed in the comments for being too vague in my last blog post, with a demand for a video/demo.

We have heard and are responding!

We're working on more, but check out this quick demo of concept topics.

Check this out full-screen so you can see what I'm typing.

More to come.  (To read more on the ins and outs of categorization, snag the categorization/classification/concepts/tags whitepaper).

Submitted by Seth Redmore on Tue, 2011-10-04 22:18

Yes, yes, we've been very quiet.  That's because we've been working on really cool stuff, and now it's released to the marketplace.

Salience Five is our most important release since our last release!  :)

Seriously, though, we've introduced some groundbreaking new features to help you understand what's being said in all that text.

First and foremost, we are the very first text analytics company to truly harness the power of Wikipedia™ in our algorithms.  We've created the world's first Concept Matrix, a distillation of 640,000 Wikipedia™articles into 1.1 Million concepts that we can understand, compare, and extract, using the 56 Million links we've discovered between them in Wikipedia.

So what?  Concept Topics.  That's what.  Say goodbye to tedious and expensive taxonomy management when all you want to do is categorize all Tweets mentioning any kind of food, or tag any article about "crime", or classify any article about "natural disasters".  Concept Topics are going to change how you categorize content. 

So what else? Document Collection Processing.  We're moving beyond things like clustering to provide meaningful analysis that leverages the semantic and conceptual similarities inherent in a collection of related documents.

Which brings us to:  Salience Facets and Concept-based Facet Rollups.  Salience Facets are new to Salience Five.  These are not "search facets", even though they could be used as search facets - they are more than clustering-based search facets.   Salience Facets represent a completely new way of extracting meaning from text.

Whew.  That's a lot of new stuff.  We'll be giving interesting examples of how this can be used over the coming weeks.

Submitted by Jeff Catlin on Wed, 2011-03-30 19:38

Exciting news in the social media monitoring world today with the acquisition of Radian6 by Salesforce.com for some nice money. The move makes all the sense in the world as it gives Salesforce.com a reach into the Marketing/PR world rather than just sales. Radian6 was already the most widely recognized name in the social media monitoring space, and as an independent operator was a logical choice for acquisition. Buzz in the industry has been that they would be acquired sometime this year, but I never expected Salesforce.com to be the purchaser. Going forward I suspect that this will change the SMM marketplace quite a bit.

There is now an 800 lb gorilla in the game, with the money and brand to really put pressure on all of the other SMM vendors. It’s going to be interesting to watch this space over the next 6 months and see if there are any other big mergers or acquisitions.

Hitting closer to home, this is important for Lexalytics because Radian6 is one of our newer customers, and we’re obviously very interested in seeing the relationship succeed and expand. It’s an exciting world out there, and this acquisition should ramp up interest in social media monitoring even more, because for if a company like Salesforce.com is willing to pay $318M for a social media monitoring company, then they clearly believe that global brands have figured out the need to watch social media, and that need will translate into more business for them.

Submitted by Seth Redmore on Tue, 2011-03-08 15:43

One of our partners up in the Great White North has taken the leap and is the first to roll out our French language pack. To really sum up what this brings to their customers: Now, an English-only speaker can have access to rigorous metrics from the French language content of interest. That's pretty cool! Link to MediaVantage release

Submitted by Seth Redmore on Tue, 2011-02-22 23:37

Quick quiz - how many of the companies in this article rely on Lexalytics for sentiment analysis technology? Computerworld: Sentiment Analysis Comes of Age Answer: 3/5 technology providers (including ourselves, duh), and 3/6 of the social media monitoring companies. Awesome-tastic. Oh - they're: Endeca Cymfony evolve24/Maritz Radian6 DNA13 (now MediaVantage)