LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Jeff Catlin on Mon, 2008-07-28 04:00

We spoke with Nathan Gilliatt at the Net-Savvy Executive last week who suggested we check out the post mortem for Monitor110. It was a great recommendation for us for a couple of reasons. First, we are a company working in a similar space and you always wonder why some of your competitors seem to implode. Secondly, it was incredibly honest and that is something refreshing to find these days.

Submitted by Jeff Catlin on Mon, 2008-07-28 04:00

Infonic plc, the UK AIM-listed provider of innovative information management software, announces it has signed an agreement to combine its text analytics division with Lexalytics, Inc., a U.S. company based in Amherst, Massachusetts. The text analytics divisions of both companies will be combined in a UK company which will be called Lexalytics Limited.

Infonic and Lexalytics are providers of proprietary text analytics technologies, both companies’ solutions automatically analyze and summarize the underlining sentiment of large amounts of unstructured data. Their technologies have a variety of powerful applications, such as in the financial services sector analyzing at high speed the sentiment behind multiple news stories on company stocks. Infonic’s software is being used by Thomson Reuters, Factiva and others to analyze the sentiment behind the content of data feeds.

Lexalytics’ software is used to provide text mining and text analytics functionality to clients in the public relations, marketing, eDiscovery, business intelligence and financial services sectors, its software being used with the solutions of a number of other providers, such as ScoutLabs, Cisco, Smartbrief and Burrelles Information Services LLC. Lexalytics, Inc. has 14 employees based in Massachusetts.

“We are very pleased to be announcing this merger today. The parties have been working towards this for almost six months and feel that our new company will comprise the most competent team in the text analytics field,” said Mark Thompson, Chief Executive of Infonic. “In Lexalytics we have found a group of people who share our passion for the power of text analytics, particularly sentiment analysis, where we feel the new entity will be the undoubted leader.”

The rationale behind combining the businesses is to pool the expertise and complementary products of the parties in this specialist area and to drive joint growth in sales, utilizing Infonic’s global sales capabilities.

“This is a great deal for both parties. We have experienced increasing demand for our innovative sentiment solutions and combining our corporate and PR product focus with Infonic’s formidable financial sentiment engine allows us to cover a much broader customer base,” commented Jeff Catlin, CEO of Lexalytics, Inc. and the new Managing Director of Lexalytics Limited. “We had looked closely at several ways of growing our business which included conducting detailed negotiations with venture capital firms, however, we feel this gives us greater scale and market-dominance in a much shorter timeframe. We are genuinely excited about what we can achieve together.”

The deal values the merged entity at $40M. Lexalytics Limited will issue a further announcement on completion of the transaction.

Submitted by Jeff Catlin on Mon, 2008-06-23 04:00

The Text Analytics show in Boston last week was a sharp contrast to the one a year ago. The energy at the conference was immeasurably better, and the attendees presented most of the talks, which means we all got to see real world solutions of Text Analytics in action. I was particularly happy with the technical competance of the attendees who all seemed to have real problems that they were trying to solve. Whether this is a general trend or an attribute of who the show marketed to is difficult to say, but it was an active and engaged audience.

Perhaps the most dramatic change in this show from the last one was the total domination of Sentiment as the “must have” feature. Last year, almost nobody cared, and this year it was the topic of at least 50% of the talks. Sentiment has definately moved into the mainstream as a feature that needs to exist in Text Analytics solutions, but the user community is still trying to understand what it’s capable of and what it isn’t. The two things that don’t seem to be common knowledge yet are:

    • Automated Sentiment’s business value is on the fringes (really good or really bad news)
    • Document Level Sentiment isn’t nearly as valuable as Entity Level Sentiment

As these new users continue to dig into sentiment they’ll figure this out, and it will help sort out the real vendors that really understand sentiment, and those that are smashing the feature into their offering becuase customers are asking about it.

All in all a really interesting show, which I’ll continue to write about the rest of this week, with posts about the changing vendor landscape, and a more in depth looks at Sentiment and where we think its going.

Submitted by Jeff Catlin on Wed, 2008-06-18 04:00

There’s a lot of buzz about measuring consumer generated media. A lot of it is fear-based, predicting that those unaware of the buzz around their product will be sunk by bloggers. Some of it is the inevitable hype around any new thing and will just as inevitably die down as CGM becomes a normal part of the world. What most of it seems to lack is any kind of theoretical basis for doing the actual measuring.

A key statistic for most of the measurement plans is the amount of the “blogosphere” is covered by the particular solution under review. No company wishes to be outdone in this arena, and as new forms of content crop up, there is a chase to cover not only more content but a wider variety of it - three years ago “covering” blogs would have been enough to boast about but now any serious solution provider should have a plan to cover YouTube, Facebook (despite the terms of usage making it illegal), discussion forums, consumer review sites, Twitter, and probably some I haven’t yet thought of

The problem with aggregating all of this data is that fundamentally, most of it is of poor quality. The quality problem starts from the bottom - with content scraped from websites, consumed by ad-filled RSS feeds and other sources. Next up the stack is the proliferation of spam blogs, link farms, and other forms of deliberate gaming of the system. Then comes the problem of whether anyone cares about a particular piece of content, even if it is clean and valid content. Finally, how do we measure the content?

There are certain time-tested ways of measuring things where the individual data points are too many or too suspect to assess all of them. I believe that these measures can be applied to CGM, but it requires throwing away the assumptions that all CGM is equal and that anyone should pay attention to the entirety of it. It also means we have to lose the fear that somewhere, someone is saying something nasty or negative. On the internet, tt is not a fear that someone is flaming you, it is a certainty - but it doesn’t necessarily mean anything.

The first step is to get back to the notion of sampling the data as opposed to aggregating all of it. A sample has various characteristics that separate it from just a big glob of stuff - it must be representative, it must be clean, it must remain constant over a period of time, and it must have a metric of quantity, such as how strongly someone holds an opinion or how much disposable income a household has. Market researchers, political pollsters, and traditional PR agencies have known and practiced this for decades. Why should it be abandoned in the face of CGM?

The second step is to return to an old traditional media concept - that of trusted sources of information, or influence. Certain sources of information carry more weight with their readers than others and these should be listened to. Those without trust or influence should be excluded from the sample. This is not a radical claim but because of the lack of any reliable models for influence, the CGM world has been flattened. Ignoring this ignores one of the basic tenets of CGM itself, ironically, as the most successful CGM sites have ways to rate content and its providers. Amazon has “helpful” reviews along with “Top Reviewers” and Digg is a simple popularity poll on stories. Indeed, the problem with CGM metrics is not that there aren’t any, but there are too many, and none are reliable. As unreliable as circulation figures are for print media, those numbers are considerably more grounded in reality that anything that exists for CGM today.

But what does all of this actually mean in practice? I believe it means that people need to get their hands dirty for each and every situation. Different clients have different needs from CGM, and the solution provider needs to be able to understand the space the client is in, help the client get a valid sample of content sources together, and construct a list of trusted information sources. No automated system crawling the blogosphere will accomplish this and produces no more useful information than the client using Google blogsearch and reading a few hundred random articles.

Submitted by Jeff Catlin on Wed, 2008-06-18 04:00

I just got back from the Text Analytics conference in Boston last night. I was struck by how top of mind sentiment scoring is. Two years ago, sentiment was mostly non-existent and now most vendors are advertising their capabilities. Even better, several end user presentations highlighted the value of understanding the sentiment in text.

Lexalytics has been offering sentiment for almost five years now, and we’ve made most of our sales at least in part due to our sentiment capabilities. I’m happy to see this become a more accepted part of the space and think we’ll see these capabilities pushing in new directions.

Overall, I was excited to see so many more end-users and just more people in general than we have in years past. E-discovery made a real presence and I saw several more traditional data warehousing types in the mix as well. Here’s to a rising tide.

Submitted by Mike Marshall on Thu, 2008-05-29 04:00

The news around town is the recent decision by Dunkin Donuts to pull an ad featuring Rachel Ray and her offensive scarf. If you haven’t heard or read about the story yet, you can catch it here: Dunkin Donuts pulls Rachel Ray ad after complaints While I strongly support the rights of the consumer to blog about what they like and dislike, I find it absolutely absurd that Dunkin Donuts felt pressure to pull an ad for ice coffee because of a wardrobe malfunction - if you can even call it that. If we take our blogging and social media influence to this extreme, I’m afraid we’re misusing its capabilities and missing out on its true purpose and potential.

Submitted by Jeff Catlin on Thu, 2008-05-22 04:00

I’ve just finished reading an interesting take on the frequent Twitter - yes this is another post about Twitter, but this time its not really knocking them, so much as talking about the engineering challenges that they are having - outages over on Techcrunch and it occurred to me that the problems they are having are actually similar to problems that we have seen some of our customers struggle with in the text analytics space. Now I don’t pretend to know the first thing about Twitters architecture, but given its built on Ruby on Rails I would guess that everytime a Tweet is sent in an SQL query is executed to get the distribution list and then its sent out. Actually I hope its not that because that would be a horribly inefficient model but lets work on the assumption that’s how they are doing it for now - they must at least be caching the distribution lists for the frequent tweeters. This approach of course is the norm coming from most Web based shops as the LAMP (LAMR?) stack is the dictating factor in many designs, and can be pretty much seen in the above 2007 slide from former chief Twitter architect Blaine Cook but if you can think around that then the problem becomes IMO a very solvable one, at a very low cost both in engineering time and hardware. Twitter is effectively a big alerting service, when a piece of text is submitted all the people subscribed to that feed (for lack of a better word) need to be told that a new message is available. This is no different from say creating a Google news alert about your favourite subject and having it send you frequent email updates as new text that matches your query flows in (actually that’s a much a harder problem as in the Twitter case the people the alert needs to be sent to is fixed rather than being variable by queries) Recently I built a alerting service POC (I had a few hours to kill!) that takes a piece of text and executes queries against it telling you what queries hit. Sounds perfectly reasonable and something that would be easy to do. However the telling bit is that I’m able to execute approximately 1/2 Million queries against a piece of text (1000 odd words) in around 40ms, all on a bog standard desktop PC running on a single core - and yes increasing cores increases performance linearly, but increasing queries doesn’t, going to a million queries will only slow it down to approximately 50ms. The trick of course is not to write it all using SQL and a scripting language but to craft it in good old fashioned C / C++ - something that seems to be less and less fashionable these days, but still can’t be beaten for high performance plumbing problems. In the prototype I used good old SQLite as my persistent data store, but in a production system I would use a custom storage format to increase the speed even more - the bottleneck in my alerting system is ensuring that the query you think hits actually does. This of course wouldn’t be a problem in a Twitter type solution as its a binary query. This sort of approach is the perfect answer to Twitters constant downtime. Breaking away from the LAMR stack and writing some custom code that is designed from the ground up to solve the problem at hand should solve all their dataflow issues using a relatively small number of machines and in a relatively short space of time. Certainly less than trying to make a scripting language based solution scale. Of course for those of us who just don’t get / understand twitter then all that downtime could be seen as a bonus!

Submitted by Christine Sierra on Thu, 2008-05-15 04:00

With the arrival of spring comes the arrival of the tradeshow and conference season. It appears our plate is full over the next month or so as we travel near and far to meet with colleagues, vendors and prospects. Next week, May 19 - 21, we will be attending the Discover ‘08 show hosted by Endeca in Orlando, FL. We hope you will stop by if you are at the conference. We are proud to announce we are a finalist for the MITX Technology Award in the Analytics and Business Intelligence category. We’ll be attending their awards ceremony on June 3rd in Cambridge, MA. Finally, we’ll be attending the Text Analytics Summit ‘08 in Boston on June 16 - 17 and are looking forward to the wide variety of speakers and roundtables they have planned there. Enjoy the season.

Submitted by Mike Marshall on Tue, 2008-04-15 04:00

The other day we were sitting around the office talking about various blog related issues (as we do quite often) and I asked some of the younger members of the team which blogs they read. To my surprise they all said that they don’t read blogs. Now this shocked me even more as this was the engineering team that I was talking to, a bright bunch of 20 something’s that I would have thought would be at the cutting edge of blog consumption. The same day Tim (our VP of Professional Services) sent me a link to this article which talks about a new survey of peoples blog reading habits with regards to politics (of course we have our own Politics tracking site over at http://www.politicaltrends.info) and includes this interesting statistic.

ComScore looked at its online traffic stats and found that 40 percent of all U.S. Internet users visited a blog at least once during February.

Why is this interesting -well the flip side of those figures is that 60% of US Internet users never read a single blog! Surely this can’t be right, I thought to myself, so I asked friends and family about it and they all pretty much agreed, they just didn’t read them - however what they did read and use comprehensively were review sites like Epinions and Yelp. As Alice once said ‘Curiouser and curiouser!’

Submitted by Christine Sierra on Fri, 2008-04-04 04:00

At Lexalytics, we host a demo site called PoliticalTrends.info that tracks and analyzes content from over 300 political blogs. We use it mainly as a forum to showcase our software and to have a “live” version of our capabilities up and running at all times. As we always do, we track hot themes in the last 3 days on the home page and I noticed that one of the theme’s over the last 3 days was “yoo memo”. This theme referred to an 81 page memo authored by John C. Yoo that released to the public and discussed, in part, authorizing torture of government detainees. I certainly don’t want to give the impression that we are in the political analysis business by posting an opinion about politics, however I did find it interesting that there was a post about the lack of coverage of the memo from traditional media outlets. The fact that it was one of our hot themes in the political blogs, and hardly mentioned on the big news networks, says a lot about how lines are being drawn defining “newsworthy” information. Apply that to business and imagine how difficult it may be for companies or businesses to know what is being discussed in the blogosphere if they only focus on analyzing the traditional outlets. The hottest theme in the blogopshere may not be what is coming in on your news feeds - and while this was about a nationally released government memo - imagine what you could be missing out on regarding your company’s brands or products? Disclosure of Torture Memo Fails to Grab Tradional Media’s Attention