LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Christine Sierra on Thu, 2009-03-19 04:00

I’m often asked how someone would use our text analytics and sentiment software within their organization. Most inquiries come because a person knows that analytics on unstructured text is important. But they aren’t quite sure how to work it into their business in order to get meaningful results. Here are a few scenarios to show you how our software is used by our customers: Text Analytics - Enterprise Search Lexalytics Customer Case: Endeca Situation: You have streams of information coming into your company and it all looks the same. Sometimes you know what to search on, but other times you need to find the hidden information in all that unstructured content. How do you improve your enterprise search capabilities to get even better results Solution: Integrating text analytics into the enterprise search application will allow you to find information you didn’t even know existed because it can extract entities like people, places, companies, products and relationships - and that allows you to access information without necessarily knowing what question to ask. Reputation Management - Market Intelligence Lexalytics Customer Case: Cymfony Situation: You know people are talking about your company, your products and brands, yet you don’t have the time or resources to read a million blogs a day. How do you discover the sentiment contained within all the information out in the blogosphere? Solution: Lexalytics has spent years refining and improving the sentiment analysis software used by many of the vendors offering reputation management, today. By analyzing sentiment at the entity level, and not just the document level, we are providing more accurate results from your data. Classification - Taxonomy Lexalytics Customer Case: SmartBrief Situation: You know you have streams of content flowing into your company, but you can’t find an easy way to manage all the various buckets of information. How do you slice and dice the content without taking months to set up a new taxonomy? Solution: The key to our classification solution is the ability to easily edit the taxonomy to your business needs. We don’t believe one method of classification is the end all, be all, of a solution so we allow users to select from a variety of methods As John Harvey at KMWorld recently pointed out, there may be several reasons why text analytics is important to your company, but I wanted to share the three that we see the most in our business.

Submitted by Mike Marshall on Thu, 2009-03-12 04:00

I’ve long held the belief that having the right mix of people in a development team is essential to that team being able to turn out good code. I’ve seen first hand the problems that occur if you have a dev team that consists of solely academics - nothing ever gets done because there is always another academic issue to solve, and the converse problems if you have a team full of people who think they are superstars - normally they aren’t and the ego clashes again prevent any real work ever getting done Smile However recently I came across a paper (via the blog assertTrue) that tried to breakdown developers into 3 distinct types. The 3 classifications are, by the very fact that there are only 3 of them, simplistic, but do cover a fair chunk of developers that I have worked with over the years. They are as follows:

THE SYSTEMATIC DEVELOPER: Writes code defensively. Does everything he or she can to protect code from unstable and untrustworthy processes running in parallel with their code. Develops a deep understanding of a technology before using it. Prides himself or herself on building elegant solutions. THE PRAGMATIC DEVELOPER: Writes code methodically. Develops sufficient understanding of a technology to enable competent use of it. Prides himself or herself on building robust applications. THE OPPORTUNISTIC DEVELOPER: Writes code in an exploratory fashion. Develops a sufficient understanding of a technology to understand how it can solve a business problem. Prides himself/herself on solving business problems.

Now as I said above I strongly feel that a mix of development styles makes a development team much stronger, and looking at our current team and trying to fit them into the three classifications shows that we have a nice mix of all three. Personally I fall strongly into the Opportunistic box, which makes for rapid development, but also leads to situations where somebody says "What happens if I do this" and I go, "You know what, that’s a very good question". However some other members of the team are much more in the Pragmatic box, and are capable of taking the work that I have produced and tightening it up to produce a much more repeatable product. Which leads to the other thing that makes a strong development team, you have to leave your ego at the door (something I find difficult sometimes). You have to realise that if you have someone who you have hired to be the code optimisation guy, then you have to let him do his job and actually optimise the code, not build little walls around your work going, its mine I tell you. Fortunately being busy soon sorts out all those problems, hard to keep the code you worked on under your control if you have 50 other things that you are supposed to be doing as well.

Submitted by Christine Sierra on Thu, 2009-03-12 04:00

Remember the good ole days when, if you wanted to ruin someone’s reputation, you had to go to the local watering hole, whisper an untruth in someone’s ear, and watch the fire spread? Then, when confronted about your inappropriateness you could claim, “Oh, I was kidding. They shouldn’t have taken me seriously!” An interesting case is being considered in a NY court room where a professional model is suing for libel because she was insulted repeatedly on a blog and the defense lawyer for the blogger stated, “…that the posts aren’t libelous because they were written in the “youthful, obnoxious, bantering” style that’s typical of the Web and isn’t meant to be taken seriously.” Give me a break. The name of the blog is “Skanks in NY”. If you didn’t want your audience to think you were aiming to ruin someone reputation, then you might have wanted to think of another name. I don’t know anyone who has ever said, “Thanks for calling me a skank. Want to go grab a drink?” The lawyer also said, “…outside of court that her client’s blog garnered little traffic until news of the lawsuit broke. The blog remains available online because no one has requested that it be taken down…” Now I don’t know about you, but as a blogger on both my corporate site and my own personal site, I don’t think that my “typical” blog post isn’t meant to be taken seriously. I may joke around on them, but I want people to listen, be engaged, and learn something (if possible). And if it’s not to be taken seriously - I usually state that. Like when I wrote an open letter to the President on my personal blog because I’m afraid my 4 year old son is turning into a stalker given his recent obsession with the President. It’s obviously not true. However, it also wasn’t insulting or derogatory to the President. So, if reputation management is key for both the corporate and personal brand - and if they two are the same, as in the case of this model – then when is it okay to slam a “product” on a blog and when is it outright libelous? And when did we start defending bullying behavior because the “typical Web isn’t meant to be taken seriously”. Therein lies my problem with this case, more than anything. Blogs are like one big playground, or watering hole, and I believe you really should be careful what you say before it comes back to bite you - in the court of law.

Submitted by Carl Lambrecht on Fri, 2009-02-27 05:00

I’ve seen many postings advising companies on listening to their customers, especially as new voice of the customer outlets such as Twitter (can we really still call Twitter “new”) evolve and grow in usage and maturity. We’ve discussed it on our own blog in the past, and Chris Brogan has some excellent postings about “café-shaped conversations”. The underpinnings of these conversations is listening for feedback from your customers or the public at large, hearing what they are saying, and nurturing your dialogue with this audience. But the audience for these articles is generally businesses establishing themselves in the social web. I want to talk to a different audience in this article, the customers, and share some of my experiences in listening to you and hearing your feedback. The quotes are taken from actual emails; names have been withheld to protect the innocent.

“Do you ever sleep?”

Hearing this from a customer is great, because it means that we’ve been responsive to you on your timeline, regardless of what timezone you are in. The truth is, yes, we do sleep…sometimes…with a Blackberry or laptop nearby. But we are a small company, and we don’t have 24×7 support. Realistically, there will be times when we can’t respond immediately, and there may be cases where that delay is going to make the issue more of a problem for you. Our commitment is to respond to feedback, comments, questions, problems as quickly as we can, and as thoroughly as we can.

“I feel as though I’m bugging you.”

There’s one occasion I saw this comment and another occasion I saw something like it, and I’ve responded to it immediately in the same way both times. Absolutely not. Not every piece of feedback from a customer (or potential customer) will turn into actionable changes in our products. But, each and every piece of feedback, in the form of a comment or an issue raised or a question, gives us the opportunity to see our products through the eyes of an end-user. This comment may indicate that a customer is not getting the answers they need, or the response they need. We pay attention to that too, it means we need to try harder to understand your situation and help you.

“Thanks.”

We love hearing this one (doesn’t everyone?). We’ve gotten this when a customer, or even someone engaged in a trial of our software, has come to us with a question or problem, and we’ve been able to show them how the software can be used to solve their business problem. And with that support from us, they are able to use our product, use our product better, understand their data, or understand a different approach to their data than they might have considered before. That’s progress, and that is what product support is about. Conclusion Why did I write this? Because at the end of the day, we can write all the cool software we want. But it’s our customers, and the business problems they are applying our software to, that make it worth doing. And having the dialogue with you, the customer, is a vital part of us understanding your need and knowing we’ve helped be part of a solution. Companies are being told that listening to their customers is more important than ever, I wanted the customers to know we are hearing them.

Submitted by Jeff Catlin on Tue, 2009-02-24 05:00

In recent posts, we’ve been devoting a lot of time to our sentiment scoring capabilities, and with good reason, they are our bread and butter. However, it’s not the only thing that we’ve been working on over the last few months. One of the most exciting new features that we’ve got coming out in our 4.1 release of Salience is what I’m calling “The Entity Toolkit”. I say “calling it” because we haven’t actually finished the debate on what we’re going to name this tool, so feel free to chime in with ideas.

So what is it and why am I writing about it. Well that’s really a two part answer: first, entity recognition (People, Companies, Places and that sort of thing) is one of the “baseline” features of any text analytics engine, so if you’re gonna call yourself a text analytics vendor you better be pretty adept at it, and second we’ve taken a much different approach with our new toolkit, one which I think is worthy of some explanation. At it’s based the detection of entities in text is the process if finding proper nouns of a particular type (company, person, etc). We’ve had entity recognition in the product for years now, and its quality (precision and recall) was mediocre at best. In the 4.0 release that came out in Q4 08, we began addressing this shortcoming with a new “Maximum Entropy” based training engine for our base entity types (People, Companies, Places, Products), and the effect was significant. Our accuracy improved a ton, especially in People and Places. The only issue was that our users were still at our mercy in terms of the types of recognizers we’d provide. I’m happy to say that this limitation will now be a thing of the past.

With the Entity Toolkit, users can build and train their own entity types (think Diseases, Legal Terms, etc…). I can hear the collective “So what? We’ve been able to build our own user defined lists in lots of tools for years now.” The difference here is that we’re allowing users to dump in domain specific content, define their own types of entities for this content, and train an entity recognizer to extract this type of entity based on the context of the content, not through a simple string match. Basically we’re exposing the maximum entropy model to the users through an easy to use toolkit where you simply markup the entities in some sample content, and then train a model based on this markup. We believe that this will take entity recognition out of the hands of the geeks and into the hands of the users, where its business value will really begin to shine.

Submitted by Mike Marshall on Sun, 2009-02-22 05:00

In my last post on our new sentiment features, I talked about perceived accuracy for sentiment techniques and sort of fudged around the issue of them. This is because I’m always wary of giving accuracy numbers for content in a general sense as it depends on so many factors that are external to your control, types of data, hand scorers POV, domains covered etc. Experience has also shown us that human analysts tend to agree about 80% of the time, which means that you are always going to find documents that you disagree with the machine on. However, having said all that, customers still like to be told a base line number, it’s human nature after all to want to know how something will perform, so I thought I would do a little test using the new model based system on a known set of data. As recommended on the Text Analytics mailing list I used the Movie Review Data put together by Pang and Lee for their various sentiment papers. This data consists of 2000 documents (1000 positive, 1000 negative) and I sliced it into a training set consisting of 1800 documents (900 positive and 900 negative) and a test set consisting of the remaining 200. It took about 45 seconds to train the model and then I ran the test set against it (using a quick PHP script). Now bearing in mind this is still experimental and that we plan to make more tweaks to the model, I was pleasantly surprised (ok I was more than pleasantly surprised) at the results. Our overall accuracy was 81.5% with 81 of the positive documents being correctly identified and 82 of the negative ones. This is right in the magic space for human agreement. For fun, I then ran the same 200 test set documents against our phrase based sentiment system, expecting a far lower score, but again we performed better than I thought scoring 70.5% accuracy. With a domain specific dictionary I’m sure that that score could be pushed up towards 80% as well. So what does all that tell us? Well, it tells us that for specific domain sets you can get very high accuracy levels, though if you ran say, financial content against the movie trained database the results would be far different. It also tells us that the phrase based sentiment technique produces good results even in its base state against a wide range of content sources (we normally are processing News data after all). So, next stage is to come up with some sort of hybrid I guess, to give us the best of both worlds. Where did I put that compiler again?

Submitted by Jeff Catlin on Fri, 2009-02-20 05:00

Sentiment is one of the cornerstones of the Lexalytics business, its what brings us to the dance if you will. To this end I’m always interested in ways that we can improve the perceived accuracy and ease of use of the system and in the forthcoming 4.1 release of Salience we are introducing a couple of new things that help with this.

Since the very early days we’ve had the ability for customers to add their own sentiment phrases so that they can customise the results to match what they would expect, but as customers have become increasingly au fait with the idea of machine generated sentiment, so they have asked for the ability to capture more complex concepts than a simple phrase can provide. To this end 4.1 supports the idea of Sentiment Concepts which are basically a way of defining a concept via a search and applying a sentiment score to the document if that search is satisfied. This enables you to do 2 things

    • Bias a document score based on the type of document it is. For example reduce the scores of Press Releases as even describing bad news is always done in glowingly positive terms
    • Capture a concept such as the phone never being answered by using a simple query such as (phone NEAR "not answered") OR (phone NEAR "rang off")

This fits in with our existing model very well but it requires you to have a pretty intimate knowledge of the data and concepts included in it.

So the second new feature we are introducing is the idea of a model based sentiment system. This basically means you can train up a system based on documents that you have hand classified and use that to generate machine based sentiment going forward.

This of course is a pretty common sentiment technique and is not without its drawbacks (it requires you to hand score documents) but if you have those already then it does enable you to get up and running pretty quickly with sentiment that is focused around your specific domain and the types of documents within it.

For the 4.1 release we are going to be marking this as Experimental as I’m still playing around with the underlying model (which is Max Ent based) that we are using to determine the best feature sets etc. but it is something that we are going to carry on with going forward. I also intend to extend the technique to allowing you to do model based entity sentiment as well, but that’s further off.

Well that’s probably enough for this post, if you’ve got any observations on the directions that we are going or requests for new functionality then feel free to leave a comment.

Submitted by Mike Marshall on Wed, 2009-02-18 05:00

For a service (I hesitate to call it a business, most businesses at least have a vague idea of how to make money, though more on that later) that I don’t really like I seem to spend a lot of time writing about Twitter but with the news that they have managed to convince yet more VC’s to invest (another $35million apparently) I am drawn back like a moth to a flame. Of course most of the discussion about this latest funding round has revolved around how the investors are ever going to see a return on this investment given the semi mythical nature of a business model. According to the Biz Stone on the official Twitter blog

We are now positioned extremely well to support the accelerating growth of our service, further enable the robust ecosystem sprouting up around Twitter, and yes, to begin building revenue-generating products. Throughout this year and beyond, our small team will grow much bigger to meet the challenges and opportunities ahead.

and to this end they recently hired Kevin Thau to head up the business side of the operation. His plan, use the flow of information as the data feed for an analytics platform. Now from my point of view that sounds like an excellent plan (we had customers ask about mining twitter recently) and its certainly a better one than relying on shaky web advertising models, but will the users of twitter let the data that they are creating be used that way. You just need to look at the explosion of complaints this week when Facebook had the audacity to change their TOS to allow them to keep your content if you leave (they already own all the content when you are on the site, a fact most commentators seemed to have missed) to see how the users of Social Media think you owe them for creating something that they use. How do you think Twitter users are going to react if the same sort of thing happens to them. Could be very messy, definitely something to watch with interest.

Submitted by Mike Marshall on Tue, 2009-02-17 05:00

With customers wanting to process more and more content (and more and more content being available) and wanting to do that in an a near real time manner as possible, the throughput speed of our various components becomes something that we are very interested in. Of course customers always want more features added as well (Pronominal CoReference, Sentiment Concepts etc.) which in turn require more complex engineering solutions (grammatical engines) which in turn require more processing time. Trading off this increased complexity with the need to keep the document throughput up can be a challenging task. My internal metric has always been the ability to process 2 normal sized (1000 -1500 words) documents a second in a single thread. If we can keep to that then scaling can become a hardware problem, add more cores and you get more throughput, but with all the changes we’ve made over the last couple of releases, it was time to see if we were still keeping up with that. To this end, last week I had one of our engineers profile the various bits and pieces that make up Salience internally and we came up with some interesting conclusions

    • Machines based around a Core 2 architecture completely trounce the Pentium 4 architecture with Salience. A 3.4Ghz Pentium D was only 20% faster than a 1.66Ghz Core 2 machine and was slower than a 2.5Ghz Xeon based machine.
    • Perhaps not surprisingly, the 64 bit versions of the software were substantially faster than the 32bit versions with a document taking 338ms on 64bit compared with 458ms on a 32bit platform.
  • Linux and Windows performance was pretty much the same, so no adding to platform debates there Smile

The testing also identified (as we hoped it would) a couple of areas that we could make some more speed improvements in the upcoming release so we should be able to get the numbers down further by the time 4.1 goes out the door. And as for the internal metric. Well as you can see for the numbers above, even before we do some more performance enhancements we are still at the 2 docs a second mark, and on the 64bit platform we are nearly at 3, which all in all makes me happy and a happy Mike is a good thing.

Submitted by Mike Marshall on Fri, 2009-02-13 05:00

If like me you don’t take the whole Social Media thing too seriously, then Being Five is the cartoon series for you. From the talented guy who does Prune Juice you can now get a 5 years old take on the whole phenomenon. Check out the site and make sure you grab the feed. Couple of my personal favourites below - click for the full size versions.