LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Seth Redmore on Fri, 2012-05-11 00:36

I ran some analysis on about 330,000 tweets having to do with people going to see/having seen The Avengers.   In case you've been completely deprived of any sort of media recently, it's a superhero movie.

Some might say "The Superhero Movie", but, not having seen it myself yet, I'm not in a position to judge.

I created query topics to look for each of the main characters as well as the actor behind that character.

The following is the table of how each of them did - as you can see, The Hulk was the most popular out of all of the superheroes.   He did, indeed, "smash".

Superheroes list

Here are the top themes overall from the tweets:

Top Themes overall

This actually does a good job of showing why I wanted to create query topics for the superheroes.  Many of their names come out looking more like themes than like proper "names".   Many of these themes aren't particularly useful, so, I excluded a bunch of them when I was doing other sorts of analysis.

Next, I decided to see what themes were most commonly associated with each of said superheroes.   As I said before, I pulled out things like "watching avengers" when I was doing this analysis, as it adds nothing in terms of what people were associating with this character/actor.

superheroes vs. themes

Starting from the top.... 

  • The "bad ass" award clearly goes to The Hulk, with a #2 place for Iron Man. 
  • Captain America clearly had an awkward moment. 
  • "Always angry" goes to the green guy. 
  • Interesting that Loki just barely edged out The Hulk for bad guy. 
  • Iron Man, hitting the pipe again.  (Well, this actually came from a tweet that was floating around asserting "Robert Downey, Jr. started smoking marijuana at the age of 6".   If I were to be really rigorous about this analysis, I would separate the characters from the actors.   But, I'm not going for statistical rigor here.)
  • Black Widow gets the best outfit award
  • And the "sexy ass" award goes to Thor

Remember, themes are exact noun phrases that are present in the content.   Were I to want to spend more time on the analysis, I'd probably set up some more classifiers and start grouping themes into buckets.  But, this quick and easy way allows me to get a feel for what people are saying.

I've also decided to start delving into hashtags.   Here's the top hashtags for the movie:

Hashtags

Nope, not much interesting there.   I am going to be doing more research around hashtags, as there's clearly some more and less useful ways to use them.   None of the crosstab stuff I did with hashtags and sentiment/characters revealed anything interesting.

Speaking of sentiment - where's the sentiment reports?   Well, there wasn't a lot of interesting sentiment.  It was largely positive, with a bit of misclassified negative sentiment due to creative use of profanity.

There you have it.  The Avengers, as according to Twitter.

Submitted by Seth Redmore on Tue, 2012-04-03 15:37

Facets are another unique feature of Lexalytics' Salience Engine.  They provide a quick and easy way to get to actionable information about opinions given in reviews and surveys.

Facets rely on "Subject Verb Object" (SVO) processing of content, with a dash of the Concept Matrix thrown in.  SVO processing lets us extract Facets and associated Attributes from sentences like "The service was slow."  In that case there's a Facet "service" with an Attribute of "slow".   Get enough of these in a set of reviews, and they bubble very nicely to the top.   Lexalytics' Concept Matrix helps us automatically associate like facets, even if they aren't the same word - rolling them up together.

In the 6 minute video below, I talk about how Facets work, and give an example based on 2 sets of restaurant reviews from Yelp.   Facets present a very different view of the data than Yelp gives natively - rather than just seeing the featured reviews and a histogram of review grades, you can actually see which restaurant is hard to get into and which restaurant has a rocking gay bar.

Submitted by Seth Redmore on Mon, 2012-03-26 14:30

Hi!  Concept Topics are our revolutionary way to create classifiers for what used to be hard-to-classify buckets.   Things like politics, food, real-estate, business.  Most of our customers need to do some sort of classification - bucketing responses on surveys, determining which area of business is being talked about in the press.

Below is a 3 minute video where I configure classifiers for Politics, Health Care, and Real Estate  Of which about 1.5  minutes is actually making the classifiers, and the rest is a brief explanation of how it all works.

 

Submitted by Seth Redmore on Thu, 2012-03-01 02:38

Datasift (www.datasift.com) just announced a fancy new service.   They have worked out a deal with Twitter where they can provide 2+ years of historical tweets.

This is exciting for a number of reasons.  Not least of which, you can run very interesting trend analysis for streams that you weren't already capturing.  Step back for a moment (figuratively, not literally - I want you to still be able to read what I'm writing) and ponder this.  One of the real problems with doing any sort of analysis on rapidly emerging events is that you have to actually catch them when they first happen.

And the only way to do this would be to consume and store all of Twitter.  This is a tremendously difficult technical undertaking, and fortunately you don't have to do that now.   You can ask all kinds of interesting historical questions, see cause-and-effect.  Perhaps find a literal "butterfly effect" resident in the Twitter data.

You can see the very genesis of a topic, trace it back to it's roots before it became a trending topic.   This is a hugely powerful idea for historians, politicians, and, yes, marketing people.

And therein lies some controversy.   Articles like these:

Daily Mail: Twitter secrets for sale

IT World: Time to clean out old Tweets so they won't define you the rest of your life 

Datasift is NOT providing content from private Twitter streams or tweets that have been deleted.   So, this means that the only Tweets that are accessible through the service are completely public tweets.

"Secrets" not so much.  Is there really a person on this planet that believes that what they Tweet is anything other than completely public information?  If there are, then they are a strong argument for the need for a license to use the Internet.   Sir, please step away from the keyboard.

This is not new.  Politicians, historians, and evil marketing people have been consuming and storing Twitter feeds since Twitter became popular.   In fact, there are 2 major differences here:  

1) it's everything

2) Datasift have actually instituted the ability to remove Tweets.   With any other system storing Tweets, they don't get deleted when you delete them.

Oh, wait, I'm sorry.   The fact that it's everything isn't even new.  The Library of Congress stores every single Tweet.  And they don't even have a "don't store deleted Tweets policy."  So, everything you've ever said on Twitter is stored permanently in a way that can be accessed.  Granted, with "scholarly" aspirations, but it's still there.

When you Tweet, you're using a free platform.  This platform has a Terms of Service that states that Twitter is allowed to do exactly what they're doing.  Twitter is a business that exists to make money, they are not an organization that aims to provide a free messaging platform for the good of all mankind.  

But what really annoys me about all this is this sudden burst of OMG THINK OF THE PRIVACY from people who should know better.   I don't know why the EFF has been quoted here, but seriously, people.  If you Tweet something, don't you think the world is watching?  

I can understand the backlash against Google and Facebook with their privacy policy changes.   Twitter has not changed their policy, and the Tweets in question were posted out to the world at large.   

We are clearly biased, as we're close business partners with Datasift, and they're using our software to analyze this heap of data.

We think it's great.   And we're also teaching our children about the value of being circumspect on the Internet.

Submitted by Jeff Catlin on Tue, 2012-02-14 00:40

I’ve long believed that Text Analytics was going to “pop” one day and start showing up in all sorts of applications.  I talked with a company yesterday that made me believe that day may be coming soon.  The company is “ScreenRetriever” (www.screenretriever.com <http://www.screenretriever.com> ) and they make a really cool screen monitoring application for parents to track their teens computer usage on things like Skype and Facebook.  As the father of two late teen boys I can say firsthand that it’s something I worry about, though thankfully I have no horror stories on that front.  ScreenRetriever has done a really nice job of putting together a clean and easy to use application that lets you see what your kids are seeing and saying.
 
Text Analytics comes in as a potential new add-on feature for services like this, where the text that’s flowing through these applications can be analyzed to spot worrisome patterns, dramatic changes in groups of friends or possibly new themes in their discussions.  A pre-built taxonomy and fixed look at the world probably isn’t going to get it done in this space, the kids are constantly coming up with new acronyms, so it would have to be a “data driven” application where the software discovered interesting terms and threads of discussion.  I can’t say if this feature will ever be added, but what peaked my interest is how different this use case is from the standard media monitoring and customer satisfaction applications that Text Analytics is typically found in.  The technology is getting good enough to start showing up in lots of different verticals, so I’m expecting to see some really interesting use cases in the next year or so.

Submitted by Seth Redmore on Wed, 2012-02-08 20:28

I'm going to write a few blog articles to show how machine learning and natural language processing techniques are used in partnership inside of Lexalytics software.

What is Machine Learning?

Machine Learning (in the context of text analytics) is a set of statistical techniques for identifying some aspect of text (parts of speech, entities, sentiment, etc).   The techniques can be expressed as a model that is then applied to other text (supervised), or could be a set of algorithms that work across large sets of data to extract meaning (unsupervised).

Supervised Machine Learning

Take a bunch of documents that have been “tagged” for some feature, like parts of speech, entities, or topics (classifiers).   Use these documents as a training set that produces a statistical model.   Apply this model to new text.  Retrain model on larger/better dataset to get improved results.  This is the type of Machine Learning that Lexalytics uses.   This sort of “supervised” approach also applies to the sort of re-training that can happen with some models where some viewer gives a “star” rating – and the algorithm adds that rating to its ongoing processing.

Some model types you’ll see fairly frequently

  • Support Vector Machines
  • Bayesian Networks
  • Maximum Entropy
  • Conditional Random Field

 

Unsupervised Machine Learning

These are statistical techniques to tease meaning out of a collection of text without pre-training a model.  Some examples are:

  • Clustering:  Groups “like” documents together into sets (clusters) of documents
  • Latent Semantic Indexing
    • Extracts important words/phrases that occur in conjunction with each other in the text
    • Used for faceted search or returning search results that aren’t exactly the phrases used for searches.  So if the terms “manifold” and “exhaust” are closely related in lots of documents, if you search for “manifold”, you’ll get documents back that contain the word “exhaust”

 

How Lexalytics uses Machine Learning:

Lexalytics only uses Supervised Machine Learning, as our system is used for reporting and trending applications as well as for information retrieval.  Unsupervised Machine Learning applications tend to be not very useful for trending/reporting, as the clusters and information reported will change with the clusters, and also because the Unsupervised algorithms can’t work on single documents.  So, if you have a document that is in set A, and the same document in set B, the clusters that you retrieve can be different in set B, even though that document is exactly the same.

There are 4 primary applications for which we use Supervised Machine Learning:

  • Part of Speech tagging
    • We use Parts of Speech for a number of important Natural Language Processing tasks (more on that in the next blog post), we need to know them to recognize Named Entities, to extract themes, to process sentiment.   So, Lexalytics has a highly robust model for doing PoS tagging.
    • Named Entity Recognition
      • We use a Maximum-Entropy model that we’ve trained on large amounts of news content to recognize People, Places, Companies, and Product entities.  It is important to note that the Named Entity Recognition model requires Parts of Speech as an input feature, so, this model is reliant on the Part of Speech tagging model.
      • We also have a Conditional Random Field model that can be trained by our customers to recognize entity types that we don’t include in our Maximum Entropy model.   (e.g. “trees” or “types of cancer”)
      • Document Sentiment
        • This is not a commonly used feature, but we do have the ability to gauge document sentiment based on a statistical model.
        • Sentence-based Sentiment for non-English languages
          • We use phrase-based sentiment in English, because we have access to certain lexical resources that we don’t in other languages.   For other languages, we’ve developed a sentence-based statistical model for sentiment.

 

Next up:   Natural Language Processing

Submitted by Seth Redmore on Mon, 2011-11-07 07:12

In honor of the Sentiment Analysis Symposium this week in San Francisco (you are going to be there, right?), here's a summary of best practices for tuning sentiment.   These will work for any sentiment analysis system, but you should use ours.

Because it's the best, 'natch.

1) 2 datasets:  Gather a set of documents and split it in half. 

2) 2 people:  Have 2 people tag each dataset for sentiment, and have 2 people participate in the process of scoring sentiment bearing phrases, that way you can mitigate the risk of slanting the tuning too much towards one person's biases.

3) 2 tests:  After re-modelling or modifying the sentiment scores of sentiment bearing phrases, test against the half of our dataset that you did not use for step one.  Check to see how well it agrees with the tagging that the two of you assigned.  Then, for your second test, run it against the first half to make sure that you didn't make things worse.  You probably didn't, but "mistakes can be made".

Hope to see you in San Francisco in a few days...

Happy Sentimenting!

Submitted by Seth Redmore on Tue, 2011-10-25 18:18

In about 6 minutes, I'll show you how easy it is to configure a concept topic that classifies documents by two different classifiers:

1) Is a country mentioned that is in the Middle East?
2) Are there weapons mentioned?  

(Watch in fullscreen unless you have bionic vision and can see what I'm typing in that tiny window below.)

Submitted by Seth Redmore on Thu, 2011-10-06 22:07

I got kinda dissed in the comments for being too vague in my last blog post, with a demand for a video/demo.

We have heard and are responding!

We're working on more, but check out this quick demo of concept topics.

Check this out full-screen so you can see what I'm typing.

More to come.  (To read more on the ins and outs of categorization, snag the categorization/classification/concepts/tags whitepaper).

Submitted by Seth Redmore on Tue, 2011-10-04 22:18

Yes, yes, we've been very quiet.  That's because we've been working on really cool stuff, and now it's released to the marketplace.

Salience Five is our most important release since our last release!  :)

Seriously, though, we've introduced some groundbreaking new features to help you understand what's being said in all that text.

First and foremost, we are the very first text analytics company to truly harness the power of Wikipedia™ in our algorithms.  We've created the world's first Concept Matrix, a distillation of 640,000 Wikipedia™articles into 1.1 Million concepts that we can understand, compare, and extract, using the 56 Million links we've discovered between them in Wikipedia.

So what?  Concept Topics.  That's what.  Say goodbye to tedious and expensive taxonomy management when all you want to do is categorize all Tweets mentioning any kind of food, or tag any article about "crime", or classify any article about "natural disasters".  Concept Topics are going to change how you categorize content. 

So what else? Document Collection Processing.  We're moving beyond things like clustering to provide meaningful analysis that leverages the semantic and conceptual similarities inherent in a collection of related documents.

Which brings us to:  Salience Facets and Concept-based Facet Rollups.  Salience Facets are new to Salience Five.  These are not "search facets", even though they could be used as search facets - they are more than clustering-based search facets.   Salience Facets represent a completely new way of extracting meaning from text.

Whew.  That's a lot of new stuff.  We'll be giving interesting examples of how this can be used over the coming weeks.