LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Seth Redmore on Wed, 2013-05-15 23:54

Well, that time of the year has come and gone, and we're all better people for it.

The place:  New York City

The event:  The Lexalytics User Group, 2013

We had a nice crowd, both in terms of size and the general niceness of people :).  I had my mad MC skillz on the mic all day, introducing a number of great speakers and wrapping things up with a view into our new software release: Salience 5.1.1 (details forthcoming).

Videos will be posted as soon as I get 'em, but for now, use your imagination.  

Anna Smith from Bitly led off, with a rousing presentation on what you can do with a view of every page that people are referencing on Twitter.  Not just reliant on the very short 140 characters, they have a great view of the content behind the shortened links.   They have hundreds of terabytes of content stored, and are adding several terabytes of new content each month.   As she demonstrated, there's incredible power in having this view - you can see what people find important to share, and give advice back to content providers and enterprises as to where they're being successful.  Also, cats.

Oleg Rogynskyy from Semantria (our beloved SaaS text analysis partner) showed how they've built a very flexible, scalable system on Salience - bringing our heavy duty text analysis platform within reach of those pesky surveys and small jobs that just aren't worth building a whole system.  Semantria is a great option for those companies with low volume projects that need access to high-capability text analysis.  Sentiment analysis, categorization, facets, themes - the whole nine yards wrapped up in a nice SaaS platform, and presented as an Excel plugin for the software that everybody uses.

Craig Golightly of Software Technology Group enraptured the group with his low-tech flipchart presentation.  I would argue that he was the most engaging speaker (other than myself, of course) of the day.  He used a fantastic analogy with a "dirty brownie" (no, really, he had brownies up there in front of the group) to show how you can carve out useful data from the messiest dataset.   Craig has lots of experience with voice-to-text, an important feeder technology for text mining.  He's a huge proponent of Voci's system, and after his presentation, we can understand why.

Elizabeth Baran of Lexalytics went over all the work that has gone into our Chinese language pack - from tokenizing words (no word boundaries in the printed text, you know) to handling Chinese idiom, she did a marvelous job of showing just why you can't rely on translation to get the most out of your text.  

Brandon Kane of Angoss showed their predictive analytics system on a #bigdata (I just thought I'd throw in a random hashtag here) set of well over a million tweets concerning various cellphone brands.  He showed interesting correlations between sentiment, Klout scores, gender, and issues like "returns".   This is information that's priceless (well, it does have a price, but work with me here) for marketing people looking to make better decisions about where to put their resources.

Russell Couterier of Cybertap had by far the most questions and really grabbed the imagination of the group.  They've built a product that's in use by a number of sooper-seekret organizations to help ferret out cybercrime, as well as being well tuned for compliance.  They can capture all the packets on very high speed links, completely reassemble the sessions (down to showing you the web pages, emails, instant messages, and the like), and then they pass it through Salience and our text mining capabilities - giving a rich view of just what is being discussed - and popping out all the anomolous and illegal behavior with a nice bright highlight.

Tim Mohler of Lexalytics displayed a few ways that our customers can use Salience in ways that allow them to get more out of the text analysis that we provide.  From rolling up discovered themes into coherent buckets to giving an "unlimited dive-in", he showed different ways that our base "set of Legos(tm)" can be used to provide very rich information through multiple processing and aggregation steps.   He also showed some experimental (well, this was all experimental), but more experimental work that uses NHibernate to take a complete snapshot of Salience output and place it in a database so that our customers can root around and experiment with the output.  This is important because often times it can be hard to decide what to extract and store as part of a big processing pipeline, by capturing everything from a dataset, he showed that you can surf around in that dataset and make better decisions around what you should show to everybody.

And then there was me, talking about Salience 5.1.1 (released on the day of the LUG), but that's a topic for another blog post.

Thanks again for all the attendees and thanks very much to the speakers for making our day a success! 

Submitted by Seth Redmore on Wed, 2013-04-24 20:39

On Saturday April 27th, viafoura is hosting a hack-a-thon to see what interesting ideas developers, product designers, and data scientists can come up with when given large data sets.

Hackers are armed with 10 years worth of  news stories from the Guardian (one of UK’s largest news publications), computing power from Amazon ($250 credit for each person), a natural language processing tool (Lexalytics-Semantria) and space donated by Ryerson’s Digital Media Zone. Mix all this together and you have all the building blocks to come up with some amazing big data projects.

Amaze us!  Wow us!  Amuse us!  Oleg Rogynskyy, the CEO of Semantria and I will judge and are both super excited to see what you can come up with!

Submitted by Seth Redmore on Fri, 2013-04-12 00:45

I happened upon (well, really, was fed via LinkedIn) a blog post by Tom Anderson over at OdinText.  I've seen some of his stuff before, and he seems like a reasonable gentleman.

I was kinda surprised by this post:  Is Social Media Worthy of Text Analytics, and thought it would be worth responding to.

To outline what he's saying (it's short, go check it out), it comes down to this:  

  • Coca Cola found that they can't use social media to predict short term revenue.
  • Twitter is lagging, not leading
  • Not many people tweet, so, you're getting a distorted sample
  • And those that are tweeting are trying to sell you on their expertise in managing social media, because, really - who tweets about Coke?  To wit:  "The fact that Twitter even scores as many mentions as it does for products like “Coca-Cola”, which most regular consumers would be unlikely to ever think about any given week, is that there are so many want to be social media marketing guru’s on Twitter and blogs trying to analyze others marketing campaigns – further proving what a peculiar sample blogs and twitter is.

Well, I haven't read the original source about how they were trying to predict revenue, so, I can't really comment on that first bullet.  I'm not sure that using it to "predict short term revenue" is as interesting as using twitter to "find places and events at which people are drinking coke" and market to those folks.

I disagree with the second bullet - it really depends on what you're looking at.  If you're looking for a reaction, sure, it's obviously lagging.  But, what if you're looking for second order  or future effects (like people talking about what they're going to do this weekend).  Brand mentions might be mostly lagging, but I'm not even sure about that.

I totally agree with the third bullet - Twitter is a self-selecting group with a large set of biases, I'm sure.

The fourth bullet was the one that I took real exception to.   So, here's what I did - I collected 24 hours worth of english tweets about Coke or Coca Cola, using the system over at Datasift.

26k tweets.  Ok, it was more like 23 hours, but I was impatient and kinda lazy, and just wanted to do this.

Let's look at what people are talking about... This is a really quick and dirty look at the top themes (important noun phrases), and the sentiment of the themes themselves over time.  This is completely unfiltered.  Color is sentiment, size is number of tweets, and no, you don't get a legend because you've been very bad.

 

 

overall themes from coke

I'm not seeing anything in there about marketing.  Even delving into the verbatims (so to speak) doesn't show much about marketing or social media monitoring (except that "crowdsourcing" and "photo booth" tie-up bits). But I do see a lot of people talking about coke.  :)

Let's do a bit more digging...  Here's the top hashtags:

 

 

coke hashtags

That "IAmSoMiddleClass" bit is an Indian thing, apparently.  The more you know...  We'll get back to the #addicted in a minute.

Datasift has some tech that lets me get gender for some tweets.  Cool.  Let's use that!  Not enough of the tweets are demographically classified to make a pretty picture, so, let's look at charts...

Here's the ladies:

 

female themes coke

(Maybe they're retail dudes, too, but it seems more likely that they're ladies.  No hating on the choice of pronouns, k?)

coke gigi hill

Speaking of dudes, what do they have to say?

 

male-themes-coke

Yes, more references to drugs, and, well the dudes like to talk about "sausage".  Hm.  Note that we could probably do some word sense disambiguation between the "coke" that is referred to as Bolivian Marching Powder, and the coke that my son likes, but, this is quick and dirty, like I said.

And for completeness sake, let's do the same thing for hashtags, ladies first:

 

 

female coke hashtags

Note the fact that there are 2 instances of "addict" on there.  (And, yes, they're talking about Diet Coke, not nose candy.)

Let's look at the men:

 

male coke hashtags

Check that out, not even a word about "addiction", and the whole "IAmSoMiddleClass" thing is concentrated with men.

So, my point, if I were to have one, is that real people are really talking about real stuff on Twitter.  Even brands like Coca Cola, someone is talking, right now, about where/how/when/why they're drinking one.   And in that information comes the possibility to learn/understand/market/connect/sell.

 

 

 

 

 

Submitted by Carl Lambrecht on Fri, 2013-03-15 16:11

Here at Lexalytics, we’re excited to be in beta with Salience Text Analysis for Chinese.

There are many features in our toolkit – sentiment, topic detection, summarization, theme extraction – but sentiment is what we’ve been best known for. With our beta release of text analytics of Chinese content, we decided to measure our document-level sentiment results against human annotations of sentiment, and compare to another public engine recently released that also provides automated sentiment analysis of Chinese.

What started as a basic measurement of precision and recall turned into a deeper effort to quantitatively determine how closely our sentiment analysis matches the sentiment judgment of multiple humans.

 

Setup

We gathered 109 pieces of Chinese content from Weibo and blog discussion forums each of which we annotated as positive or negative. The content was filtered to items that were clearly positive or negative to a human. Our intent was not to measure ability to detect subtle cases of sentiment.

Even though we marked content as only positive or negative, Salience, being phrase-based, can return a neutral result if it is unable to detect any sentiment at all. We think this is a valid approach – sometimes there really isn’t any sentiment.

The other sentiment engine that was tested, Chatterbox, was not observed to return a neutral result; it appears that all content is categorized as positive or negative, with an associated strength score.  It could be argued that you could consider a polar result with a small strength score to be "neutral"

Additionally, the phrase-based approach developed for Salience to assess sentiment in Chinese was developed from longer form text, but shorter text was needed for this test to accommodate a content length constraint imposed by the Chatterbox API.

 

Precision, Recall, and F1

The table below gives the precision, recall, and the weighted average (F1) for positive and negative sentiment within the set. The F1 score for positive sentiment and F1 score for negative sentiment are combined to calculate an overall accuracy measure.

precision recall accuracy comparison chatterbox salience

These scores are quite good for both engines, which can be attributed in part to the polarity in the test content selected. Salience performs comparative to Chatterbox in terms of positive sentiment, with slightly better performance on the detection and identification of negative sentiment.

As we have developed support for non-English languages, we have done so at a core level of deconstructing the language, and developing the support needed to handle the language natively. We feel this is a much better approach to NLP of non-English languages than using machine translation techniques to apply techniques developed for English to translated text. In order to test this, we took the Chinese content and ran it through the Google Translate API, and ran it through Salience's standard distribution for English.

 

precision recall comparision with translation

As you can see from the table above, a machine translation approach suffers from any translation issues, particularly when working with phrase-based detection of sentiment where other linguistic modifiers such as negations and intensifiers are taken into account. A model-based sentiment approach may be less affected by these issues, but will be less flexible to use across varied content domains and require more technical tuning effort.

 

Inter-rater agreement

To me, this is the most interesting part of the experiment. Automated sentiment analysis is often compared to human sentiment analysis through precision and recall tests. But that assumes that across humans there would be 100% agreement. In reality, there are discrepancies in the sentiment annotation across multiple humans, and in many cases the same human can mark the same set of documents with slight differences from one day to the next. So we want to measure the consistency of multiple human annotations of the same content, and calculate the consistency of the two automated sentiment approaches to human judgment. After all, if you can’t agree with another human, why expect the machine to agree with you? 

The same content was also annotated by an external contractor, a native Mandarin speaker located in China. Two inter-rater agreement measures were calculated, Krippendorff’s alpha and Cohen’s kappa. These two measures were also featured in an analysis of inter-rater agreement presented by Maritz Research at the Text Analytics World seminar in the fall of 2012.

For our dataset, both Krippendorff's and Cohens'indicated an agreement of 94% across the two humans, showing that even for a relatively small set of very polar content there is not absolute agreement between two humans.

In one particular example, we marked an article that was very prejudicial against Japan and pro-Chinese as negative, because of the emphasis of the prejudice. The contractor based in China, however, considered this to be positive content. Judging sentiment can be tricky, even for humans.

So how did the computers do?

Calculating Cohen’s kappa requires the results to be fully annotated, so for cases in which Salience returns an inconclusive result we considered what agreement would be if that result was taken to be a positive or negative result.   In other words "0=neg" means that if Salience returned a result of 0 (one that we would normally consider to be neutral), we will consider this results to be "negative".

 

kappa scores

Cohen’s kappa also only allows us to calculate agreement between two raters. The conclusion we can draw from this calculation is that humans agree 94% of the time on their ratings of the content, Salience agrees between 74% and 84% of the time, slightly better with one human than another and slightly better when inconclusive results are considered negative (or not-positive). Chatterbox fares worse with about 62 to 64% agreement with humans.

The calculation of Krippendorff’s alpha is more flexible, allowing for gaps in the annotations which accommodate cases in which Salience did not detect sentiment and allowing for determining sentiment across a group of more than two raters.

alpha inter rater agreement

The most interesting chart is below - so, for multiple raters (more than two), what are the best combinations. Because we're not forcing a "0" from Salience into either positive or negative, the agreement numbers end up better.   We're about 75% agreement across humans + Salience + Chatterbox.   (Which says good things about the state of sentiment analysis!)   We're, of course, happiest with the results between two humans and Salience, where we're pushing 90%.   

alpha scores with c greater than 2

These results show that Salience’s analysis of sentiment in Chinese content correlates well with human judgment. Perhaps not quite well enough to satisfy a Turing condition of generating results which are indistinguishable from those of a human, but certainly close enough to serve as a good starting point from which further phrase-based sentiment tuning can refine results.  

 

Conclusion

We think these results validate our approach of native NLP phrase based sentiment analysis for Chinese over a machine translation approach or classification model and show that on tonal content, Salience is a compelling option. At present, our attention is focused on including named entity recognition for our general release of support for Chinese. We’re pleased with the results of this assessment of our initial document sentiment analysis, and looking forward to bringing the full product to market. 

Submitted by Seth Redmore on Thu, 2013-01-03 00:39

Hello all and Happy New Year!

I snagged 412,700 tweets from Twitter from about 16 hours that went from 2300 UTC December 31 to 1500 UTC Jan 1.   This was 5% of the total tweets that went out with the phrase "new year".

Without further ado... The top 50 hashtags were as follows:

Happy New Year Hashtags

Nothing terribly surprising, but I had to check to see what some of these were:

  • #UnionJTo775K is Union-J-World, um looks to be a band of some sort who are driving to get to 775k followers with promises of following people back.
  • #VicmasDay10 is Victoria Justice (er... some celebrity?) who is making people jump through hoops and she will follow them if they simper and beg appropriately well.
  • #ICONPOPQuiz is a trivia game
  • #GetGlue is some sort of media app
It's perhaps a tiny bit sad that #GetGlue! beat out #love.   (However, if you include all mentions of "love" in hashtags, then "love" does conquer all.  Well, at least all things commercial.
Now for themes:

Themes from Happy New Year dataset

  • It warms my heart to see that "happy new year babe" beat out "happy new year bitches".  Not by much, but it did.  
  • "Good things" is slightly more popular than "beautiful people".  Speaking as someone that has more stuff than looks, I approve.
  • We see east coast, but not west coast (that's probably because of the time range of which I got content.  I was in a hurry and it was New Years Eve, what can I say?)
  • So - things like "year yaa", "year juga", and year "kak" come from people mixing other languages with English.  ("Juga" seems to be coming from folks in Indonesia.)
  • And, in terms of what group of people were being wished?  boys<friends<man<hun<girls<mate<folks<tweeps<bitches<babe<people<bro<baby<everybody<guys<everyone
Now you know.
Submitted by Rami Nuseir on Fri, 2012-11-16 19:40

Carl Lambrecht, one of our many language geniuses, wrote up a post on the Lexalytics Dev blog about how dialect differences affect text analytics, especially when taking into consideration the fact that Lexalytics only uses Native Language Packs.

He touches on a couple of great points. For instance, most dialect differences are only apparent during speech and pronunciation, not in writing, and when they do appear, Salience can often figure out what the strange new words are based on context. 

Head on over to our Dev site and check it out for the full scoop.

Submitted by Seth Redmore on Wed, 2012-11-14 20:22

Takers of the 3rd round of funding!  Or 2nd, or fifth, but I'm guessing it's third.   Whatever round it is, congrats to Datasift for building a really great service.   Their GUI for defining streams is going to be really great once it's polished (definitely a work in progress now, but I can really see how it's going to be awesome), the ease with which they provision new content sources, their responsiveness and transparency when we hit problems or concerns.

DataSift, if we weren't already partners, I'd ask you to marry me.

 

 

 

Submitted by Carl Lambrecht on Thu, 2012-11-01 00:00

At Lexalytics, we know it’s not only a global marketplace, but a multi-lingual global marketplace. It’s this understanding which has driven us to extend the capabilities of Salience beyond analysis of English.

Salience was originally developed by Lexalytics to analyze English text, traditionally from online news articles and blogs. Over the continued development and maturation of the product, more and more complex analysis of online content, including Twitter, has been added and improved.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

The first non-English language we tackled was French. Our first release of support for French provided almost all of the features found in our analysis of English; part-of-speech tagging, novel named entity extraction, theme extraction, summarization, and model-based sentiment analysis. The only feature in our English language support that was not available in this initial release was phrase-based sentiment analysis. As opposed to model-based sentiment analysis, phrase-based sentiment is much more tunable and transparent: you know exactly which phrases influenced the results as well how, and can tune the phrases to your needs accordingly. An update followed shortly which added an  improved handling of French social media content.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

FR (2010)

Y

Y

Y

Model

Y

Y

The next languages we tackled were Spanish and Portuguese, again providing model-based sentiment analysis. These languages have not yet undergone an update to specifically address unique social media complications in those languages, yet they do benefit from the techniques built into Salience itself for adjusting to handle short content.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

FR (2010)

Y

Y

Y

Model

Y

Y

ES (2011)

Y

Y

Y

Model

Y

Y

PT (2011)

Y

Y

Y

Model

Y

Y

With Salience 5.0, Lexalytics introduced the core feature of the Concept Matrix™, a model of semantic relationships derived from Wikipedia. At the same time, we were developing support for German. Additional research and development allowed us to provide phrase-based sentiment analysis for German, our first non-English language pack to include this capability.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Concepts

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

Y

FR (2010)

Y

Y

Y

Model

Y

N

Y

ES (2011)

Y

Y

Y

Model

Y

N

Y

PT (2011)

Y

Y

Y

Model

Y

N

Y

DE (2012)

Y

Y

Y

Phrase/Model

Y

Y

Over the course of the summer of 2012, we worked to develop updates to our existing language packs that would bring all of them up to the same feature level. Updates have been released for French (June 2012), Spanish (September 2012), and Portuguese (October 2012) that bring all currently supported languages up to the same level of functionality.

Language

Core NLP

Entities

Themes

Sentiment

Query

Topics

Concepts

Other

Features

EN

Y

Y

Y

Phrase/Model

Y

Y

Y

FR (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

ES (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

PT (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

DE (2012)

Y

Y

Y

Phrase/Model

Y

Y

Y

While our current language packs will see ongoing improvements (better novel entity extraction, better quotation extraction), we have our sights firmly set outside the Romance and Germanic languages. We are hard at work at applying our experience and techniques to Chinese, looking to release our initial support in early 2013.

 

Submitted by Rami Nuseir on Mon, 2012-10-29 15:36

Multi-lingual support for text analytics was a hot topic two weeks ago at the Text Analytics World conference in Boston. It seems like everyone mentioned it at one point or another, and with good reason: not many companies have strong natural language processing software.

Luckily, Lexalytics does. Our Native Language Packs include French, Spanish, Portuguese, and German. We are firm advocates of exclusively using Native Language Packs, as they catch more nuances in translation, and offer more precise sentiment analysis: we like good results.

We also have a Chinese language pack in the works, to be released sometime in early 2013. To ensure top quality, we've hired Chinese Language Expert Elizabeth Baran to work closely with our team on this project. Feel free to say hello in the comments, she doesn't bite.

Elizabeth is fluent in English, French, and Chinese (Mandarin). In addition to having lived in China for 8 months during a Chinese language immersion program, she has published two papers in the domain of Chinese NLP, which were funded by the National Science Foundation under a Chinese-English Machine Translation grant. She has a Major in Chinese Language, and a double minor in Linguistics and French from Georgetown University. 

Her credits include being a member of the Chinese Natural Language Processing Group at Brandeis University (2010-2012), publishing two papers on Chinese NLP, and presenting at the Machine Translation Summit XIII held in Xiamen, China, for research on automatically predicting noun number in Chinese. In addition to providing exceptional services to Lexalytics, she is also completing coursework for a MS in Computational Linguistics at Brandeis University, while writing a thesis on methods for automatically predicting the temporal location of Chinese verbs.

Our Chinese Language Pack will be in good hands.

Submitted by Carl Lambrecht on Wed, 2012-10-24 19:50

I recently attended Text Analytics World, an annual text analytics conference that took place in Boston two weeks ago.

The two main themes were “big data” and “social media”. There was some discussion around sentiment analysis of social media content, with vendors and analysts that presented making the following point: accuracy of automated sentiment analysis, particularly in social media content, is difficult to measure and sometimes useless. In fact, we heard several times over the course of the conference, that hyperfocus on accuracy leads people away from defining and solving the real business problem they need to.

There was a lot of discussion around taxonomies, ontologies, categorization, and in particular automated categorization. Companies that have large repositories of internal unstructured data are struggling to get a handle on and organize that content to make more efficient use of it. The work we are doing for our next release using the Concept Matrix will be a useful tool for helping companies bootstrap their taxonomies, and then adjust from that machine-generated starting point.

Multi-lingual capability was also a hot topic, which is good for us based on the language support we’ve developed over the last couple of years. Analysts cited that there is an increasing need to analyze non-English text, and machine translation doesn’t work well enough, particularly when you compound the other inherent error factors in the analysis of English content.  Meta Brown’s presentation reiterated these needs and concerns with current approaches using machine translation. And a vendor presentation on the analysis of Spanish social media content showed the insights you can extract when you analyze the language in its native form, even accounting for (and gaining even more information) from regional slang.

In terms of “semantic processing”, I think our Concept Matrix is a great strength, and we will continue capitalizing on it. One of the two sponsors was Expert System, represented by Bryan Bell. Bryan pushed the message of using semantic understanding of language to aid other areas of the text analytics stack, mainly categorization, but also entity extraction and sentiment analysis. Interestingly, Expert System claims to support several non-English languages, but do it via internal machine translation and then analysis of the resulting English, which conflicted directly with the discussion we’d heard about handling multiple languages.