Our Sentiment about Text Analytics and Social Media

Submitted by Christine Sierra on Fri, 2009-04-17 04:00

I’ve been thinking lately about how tweeting has changed my work day routine. By some accounts, I’m probably a late starter to the world of Twitter. To others, I must have a lot of time on my hands if I can spend the day on that social network. And yet many still think tweeting is only done by birds. It’s amazing how much has changed in the last 6 months. When I joined Lexalytics two years ago, there wasn’t a marketing resource in the company. They had been working feverishly to enhance the technology. I was employee number 7 or 8 and it was a challenge to identify the core market for our software. Where were we going to fit and how was I going to present our software to the world? About the same time, the social media sites and social networks were finally emerging as reliable sources of information and opinions. What used to be a scratch-and-peck search process to uncover valuable market news and research was now more readily available through alerts and RSS feeds. And then there was talk of Twitter. What the *tweet* was it all about? It wasn’t until the Boston New Marketing Summit in 08 - now known as the Inbound Marketing Summit - when fascinating communicators like Laura Fitton (@pistachio) and Chris Brogan (@chrisbrogan) simplified the tweeting process, that I started to get more serious about it. Mix in a little push from my PR firm’s social media champion Joel Richman (@xylem) and suddenly I was a twitter-er. I had actually been dabbling in Twitter with a network of TwitterMoms so I had some idea on the social aspect, but not the power of the business aspect. Now that I’ve been tweeting a bit longer, and do still wonder “Who really cares what I tweet?”, I thought I would at least outline how it has changed my everyday work day and the value I see out of joining the world of Twitter. Believe it or not, I do see an efficiency factor to it all. Sort of like the introduction of call waiting to the phone. Remember? Before it existed we’d get busy signals and have to try calling back, over and over again. How did we ever get anything done at work? I also discovered the following along the way:

    1. Community Mind Share: I’m still a firm believer that face-to-face interactions are irreplaceable, but it’s hard to ignore the power of a platform where you can jump into a community and engage with people through a variety of opinions, reactions, questions, answers and statements. I don’t know if I believe having a million followers makes you a better member on Twitter, but I do know that interacting with your community however big or small, opens up an incredible mind share. 140 characters can lead you a world of knowledge - either in text, video or pictures. Join in!! All it took was following two or three people on Twitter, and I would scan their following lists and grow from there. Pretty simple.
    2. Networking Opportunities: Where are you going to be tonight? Unless you were on the VIP email distribution list, it was often hard to know where folks were gathering, and what they may be discussing. Thanks to TweetUps and other networking announcements, it’s easier and more efficient to plan your calendar and join in that face-to-face mind share that still exists. In our area, @BostonTweetUp has been a fantastic source. But sometimes it’s just one person, like @BobbieC tweeting about her Mass Innovation Nights (@MassInno) that can get you hooked.
    3. Faster Research and News: Breaking News! While some items must still be taken with a grain of salt, most of the published blog posts and news stories linked through the 140 character tweets offer insightful information and resources. And, they are arriving at my desktop faster than ever before. Want to get quick tips to Tweeting? You got it. How about new product releases or beta sites in your industry? Done. Want to interact with journalists or breaking news agencies? You can do that, too. Traditional news sources such as CNN (in case you hadn’t heard) are on Twitter @cnnbrk and more innovative sources like @mashable are there, too. Take your pick.
    4. Lead Generation: Seriously. You can find leads on Twitter. I swear. And not by being a spam-tweeter shouting about your super-biotic health pills, but by actually engaging in conversations. About anything. Sometimes that random common thread turns into a potential lead. ROI has been a big topic, but I think if companies would sit back and think how long it used to take for the “perfect marketing collateral” to hit a prospects mailbox, snail or electronic, read, reply (maybe) and engage - well, you get the point. What do you lose by talking - about work, weather, traffic or food (Jeff Cutler (@jeffcutler) can attest to the interest in a picture of a breakfast plate)? You just never know who’s listening (or watching)!
    5. Honest Inspiration: The world is a pretty gloomy place these days. Sorry to be the one to tell you. Bad news swirls the local and network news every night. I’ve found on Twitter I’m inspired by stories of hope, charity, laughter and encouragement. I have never logged off thinking “Wow. Twitter was a bummer today.” And I believe having a positive outlook, all around, can only help motivate you in the workplace.

I gain knowledge, insight and inspiration from many different tweeters every day, all of whom I’d love to send your way. But instead, just check out your favorite person’s following list and discover some on your own. And while I try to focus on learning more about work-related information during the 9 to 5, there’s always a little time to play. I mean it is only 140 characters - and I do work from home - so it’s my virtual water cooler to talk about American Idol or the funny at-home-mishap. Now, time to get back to tweeting! And if you’re curious, you can find me at @christinelexa or our company at @lexalytics

Submitted by Jeff Catlin on Mon, 2009-04-13 04:00

I’m sure this seems a strange title for a post, given that Lexalytics and this blog have been around for quite a while now, but I’ve been noticing a shift in the questions I get about our technology. Until recently most of the folks that we talked to had a very good idea of what a text analytics engine provided, but as Text Analytics has gone more mainstream I’ve seen an uptick in folks needing a briefing on the basic definitions of the various features like classification, categorization, entity recognition, themes, concepts, sentiment and relationships. So, I’m going to use a couple of blog posts to provide a basic overview of these concepts. I’m going to start with Classification, Categorization, Themes and Concepts, and then use my next post to review Entity Extraction, Sentiment Scoring and Relationships. Fair warning… I may raise the blood pressure of a few librarians with my rather loose definitions, but I think people need general use definitions to help them map these concepts to their day to day business needs. So let me start by annoying the librarians right out of the blocks… For most of the world Categorization and Classification are the same thing. I’m going to stick with the term classification, but it could just as easily be categorization. The simple definition is:

            • Classification is basically a group of technologies designed to fill in the missing piece of the following: “This document is about BLANK”.

Of course its rarely as easy as that, but the basic idea of Classification/Categorization is to put a given document into a subject oriented bucket. These buckets may be organized as a flat set of topics, or as a more complex hierarchy on of taxonomy nodes, like:

    • Sports
      • Baseball
      • Football
    • Basketball

Themes or Concepts are sometimes confused with Classification/Categorization but are an entirely different animal. Themes are noun phrases that occur within a document. As an example consider a document about the current state of the economy. The document might be classified as “financial results”, but the themes or concepts from the document would include things like:

    • banking crisis
    • toxic assets
    • stimulus plan
    • bad assets

The use of classification/categorization and concepts/themes are 2 different mechanisms to help users understand what a document is about. A great way to examine and understand these 2 different text analytics features is to try them out on a site like Newssift where you can navigate financial news content via both categories and concepts.

Submitted by Christine Sierra on Fri, 2009-04-10 04:00

Text Analytics Leader Lexalytics Releases Salience 4.1 Offering Entity Management Toolkit for Better Entity Detection Company’s release provides support for user-generated entity extraction models that will save businesses money and offer more control over presentation of information. Amherst, MA April 6, 2009 — Lexalytics, Inc. (www.lexalytics.com), a software and services company specializing in text and sentiment analysis, announced today it has released Salience 4.1. This latest version of the popular Salience Engine includes a powerful Entity Management Toolkit used to build and train entity recognizers by understanding the user’s data. This feature will reduce the amount of money spent preparing content for applications, and will give the users more control over how their applications are presented. Currently, Salience technology is widely used professionals in the marketing, public relations, enterprise search and financial services industries to manage brand reputation and improve the business value of information. Salience 4.1 will solve entity extraction problems for companies that process specialized content. The Entity Management Toolkit easily allows users to mark up documents specifying entities that are important to them, and the system will then make suggestions on other entities. Once enough documents have been marked up to build a model the system is ready to process the entire corpus of data. The user doesn’t need to write a single line of code for this process to work. “With the release of Salience 4.1 and the Entity Management Toolkit, we have improved the way companies create their own intellectual property,” said Jeff Catlin, CEO, Lexalytics. “Until now, users have had to rely on the vendors to provide recognizers, but the vendors are not the content and domain experts. The key to this release is in empowering content owners to expose the value of their own content, particularly in market and industry segments.” Salience 4.1 also includes separation of user customizations from default data files for easier management, and performance improvements to document processing time. A free trial version of the Lexalytics text analytics software is available at www.lexalytics.com.

Submitted by Christine Sierra on Thu, 2009-04-09 04:00

It has been a busy couple of weeks as I have been emerging from hibernation and finding my way back out into the wild. There seems there have been a variety of events happening in and around Boston signaling the renewed spirit that comes with spring. It has been refreshing. We exhibited at the kick off event for Mass Innovation Nights last night in Waltham. It was a pleasant surprise to see the number of people milling around the Charles River Museum of Industry. It was especially fun for me to return to my hometown and see what has changed and what is the same. Walking down Moody Street was pretty surreal. All this activity also got me thinking about my own social habits and what makes me most comfortable when I network. I attended the MarketingProfs Digital Marketing World last week which was literally a tradeshow online, complete with exhibit booths and background chatter. And while I enjoyed the sessions and thought they did an excellent job pulling it together, I found it incredibly rude of the attendees that were in the virtual auditoriums using their “semi-anonymity” to make fun of and mock the presenters. I highly doubt they would be doing the same thing if they were sitting in a room listening to that person, so why did they feel they could do it virtually? Has the social network aspect of our society made it acceptable to publish unflattering or disrespectful remarks because we can stay behind a wall where no one really gets to know us? Last night was the complete opposite. This event was marketed solely through social media. We blogged, tweeted and retweeted to get supporters. I know our company was excited to see the outcome. What I found even more exciting was that I was going to have the chance to shake the hands of people who, up to this point, had been a 1X1 photo on Twitter. I don’t live close to Boston proper so I have missed many of the @bostontweetup events in our area, but this night certainly proved that you can mix social media outreach with real-life interactions and find success. I’ll probably continue to attend the virtual events as I think they have some benefits, particularly a clear cost benefit of not having to travel the country to attend, but I think those that attend should keep in mind that if you wouldn’t say it to someone in person, you probably shouldn’t type it about them online. Very distasteful. Check out other reactions to Mass Innovation Nights from TweetWorks, The Lost Jacket or search #MIN on Twitter.

Submitted by Christine Sierra on Fri, 2009-04-03 04:00

On Wednesday, April 8, Lexalytics is pleased to be taking part in the launch of Mass Innovation Nights. We will be showcasing our latest release, Salience 4.1, and answering any questions you may have about how text or sentiment analysis can help your organization or business. Mass Innovation Nights are a series of events designed to harness the power of social media to showcase Massachusetts-based innovation with a monthly event at the Charles River Museum of Industry and Innovation in Waltham, 154 Moody Street in Waltham, MA (www.crmi.org). We will be joined by these other Charter Members:

    • IBM/Lotus
    • Invention Machine
    • LuckyCal
    • Mass High Tech
    • Noteflight
    • Practical Solar
    • Tweetworks
    • WiBlocks
    • Xelago

We hope to meet you there!

Submitted by Mike Marshall on Thu, 2009-04-02 04:00

As part of our ongoing plan to increase our coverage for support requests, I’ve recently had to take the plunge into Linux. I’ve always been a windows guy and that’s where I’m most comfortable (Kevin on the other had is pretty much the reverse, living in his SSH sessions) but needs must so Linux it is. Not wanting to go the whole hog at this point I decided to just create a virtual machine using Virtual PC, that Ubuntu would be my distribution and (foolishly)thought it would be an afternoons worth of work. Silly me! Virtual PC, whilst fantastic with Windows installs, doesn’t seem to play as well with Linux and to start with I couldn’t even get the installer from the live ISO to run. However Google as always comes to the rescue and I found a fantastic article on Techrepublic that showed me what command line params I needed to add to get the installer up and running and an hour or so later I had a fully working Ubuntu VM, all be it at some poor graphics resolution. Of course that was only the first step, I then needed to get SVN onto the machine (managed that first time), download all of our source and then build it. This is where I hit my next major wall. I grabbed the Boost package we use (1.34.1) and went to compile it and hit error after error. Thanks have to go to Kevin for sitting on the end of Skype helping me diagnose them, but they included

    • I had gcc installed by default, but needed to install g++ as well
    • Missing development libraries for python, bzip and zlib
    • Problems with Boost 1.34.1 and gcc 4.3, this required applying a patch file we found on the Boost bug db (had to install patch as well Smileand then going into various regex headers and adding #includes

However after several hours slog I got Boost to build, created the necessary symbolic links and was ready to build Salience. Sure it will be easy says Kevin, mumbling that he hasn’t actually built it with gcc4.3 before. Nope, not easy. From where I have got to at the moment it looks like headers which are included by default in earlier versions of gcc are no longer included so its going to be an edit job to get it up and compiling. Don’t know about running yet, though my task tomorrow will be to download our new tarball and see if I can get that to work just following the instructions in the wiki One last thing, after doing my install Ubuntu wanted to install 273 patches, found that very amusing for some reason. More later.

Submitted by Mike Marshall on Thu, 2009-04-02 04:00

Just a very quick note to point you in the direction of a fantastic resource that Paul (one of the other members of the engineering team) pointed me at the other day. Its part of the ICWSM 2009 Data Challenge and is a set of 44 million blog posts generated between August and October 2008. Its in XML and is only 142 gigs after you uncompress it. Could form that basis of a great test set a la TREC maybe?

Submitted by Tim Mohler on Wed, 2009-03-25 04:00

At a customer request, I spent some time looking at Newsgator’s API. The customer wanted to see if they could get blog and social media content about a specific company from a single place. Currently, they are searching Google Blogs, Twitter, etc. The model they had in mind was Moreover’s company feeds, for example this one on Dupont. Unfortunately, I couldn’t get clean content. There was a lot of foreign-language content and I could not specify the language I desired. There was obvious spam and so forth - all classic problems of information retrieval on the web. Moreover’s free feeds are a lot cleaner due in no small part to content selection by humans, but to really get something you want, you need to pay and customize the feed. Most customers don’t want to see every single article about a company - they have a target area. You might say this is an opportunity for a Reputation Management or Media Intelligence platform. There are many, at various price points. However, for a lot of people, that’s a much bigger pipe of information than they really want and comes with a lot of overhead of its own. Could it be possible/worthwhile for a content aggregator to set up an infrastructure that allows more customization, takes care of enough of the cleanup, categorization, etc, and is cheap enough to be ad-supported, like Moreover’s free feeds? Somethine that could provide a “pulse” of information on a target area you are interested in without involving things like historical reporting, influence graphs, campaign management, and other unwanted content? Has Google, with all its problems in searching social media, done well enough to make this a losing proposition for anyone else? I’d be interested in hearing what other sources people go to for relevant, aggregated information without investing in a full-blown vendor solution. Or maybe I missed some key features of Newsgator and would welcome learning how to get cleaner results.

Submitted by Mike Marshall on Wed, 2009-03-25 04:00

Over the past couple of days several CMS companies have been coming clean about their product, due in no small part to this blog post, which ties in nicely with Jeff’s post a while back about transparency amongst the Text Analytics vendors. To this end I thought it would be fun to get the ball rolling on our side of the fence and see what sort of response we could get.

The Rules

So slightly modified from the original rules:

  1. A Text Analytics vendor is challenged to honestly answer all items on below “Reality checklist for vendors”
  2. If possible the vendor has supply screenshots, links or other means to make it easy to verify the answers.
  3. The answers also need to be supplied in a short form of one to three stars (denoting “no”, “sort-of”, “yes”).
  4. Answering all questions on their blog allows the vendor to tag some other Text Analytics vendors.
  5. A tagged vendor should provide a link back to the blog that tagged them.

Questions and answers Right, so question and answer time…

  1. Platforms We support various flavors of Windows (XP, Server 2003 and Vista) together with Linux. We support both 32 and 64 bit machines natively.
  2. Ease of Install/licensing Ever since V1.0 we have tried to make the installation process as simple as possible, bundling everything into a standard installer package on Windows (built with the ever-lovely NSIS). On Linux everything is packaged up as a standard tarball. Just unpack it, set an environment variable and you are ready to go. It really is a 30 second job. Licensing is only slightly more involved but at the end of the day involves getting a license file from us and saving it to some place on your machine. A recent quote from a client: I think it took me longer to download and untar it than it did to get a quick sample running
  3. Ability to start processing customer data and then give some meaningful results straight from the box First impressions are important, and while the best results from text analytics are going to require at least some degree of tuning to your particular content of interest, it shouldn’t be necessary if you want to have a first look at results. Our Salience install provides a sample application that can be used right out of the box on your own content to judge exactly how the default model provided will operate. A command line utility is also provided that can be scripted to automate proof-of-concept testing and results analysis.
  4. Flexibility of entity extraction There are many exciting applications for text analytics, and just about every customer we’ve come across has a different focus when it comes to their content. Many text analytics vendors offer standard entity recognizers for people, companies, places, etc. But how good are those recognizers when it comes to a new company that comes onto the scene? Can you adjust and tweak the provided entity recognition? You can with Salience, user customizations are a matter of text file editing from simple authority lists to developing complex patterns for advanced entity recognition. With Salience 4.1 we’re taking it one step farther, with the introduction of the Entity Management Toolkit. This new tool allows you to train an entity recognition model, for the types of entities that are important to you. Jeff’s favorite example is a disease recognizer. Annotate six documents containing mentions of various diseases, and use the model in your Salience environment to recognize diseases mention in other documents, without authority lists
  5. Document level and Entity level sentiment Many people consider entity extraction to be enough in the world of text analytics. We disagree, we consider sentiment analysis a must-have feature for text analytics vendors. Salience provides both document-level and entity-level sentiment, and uses a Grammatical parse to determine which bits of the document are talking about which entity. We allow customers to customize the sentiment by applying their domain specific knowledge to the data that they are working with and going forward we will be extending our Sentiment Toolkit application to generate our new sentiment models that exist alongside our traditional methodology.
  6. Ease of integration There are a number of technology solutions, not just in the text analytics space, which are provided via web services. The introduction of web services was a boon in that it allowed access to functionality without needing to imbed a vendor’s library into your product. The down-side? Network latency in performance-intensive applications. Need a web service for a distributed application? This can be written around the Salience C API with the development technology of your choosing. To support this we ship with a fully functional C API which is then wrapped up by native code for the other main development languages. So we have wrappers for Python, PHP, Ruby, Java and .NET that support all the same functionality that the C level wrapper does. We ship the source code for those wrappers as well so if you want to add your own functions you can.
  7. Documentation and SDK samples All of our software and its associated API’s are documented on a public wiki available at http://dev.lexalytics.com/wiki. Being a wiki enables us to make very quick edits and also allows us to give access to customers if required but also puts a task on us to make sure that it is always up to date and is structured in such a way that customers can easily find the piece of information that they are looking for. As stated above,we also ship the source code for all of our wrappers, together with source code for both of the sample applications that we install in the package.
  8. Reporting Problems We maintain a public bug database at http://dev.lexalytics.com/bugs/ which enables customers to see if an issue they might be having is happening to other people as well. All of our internal services staff use this bug reporting mechanism as well, transparency is something that is very important to us. The dev team can also be contacted directly via a diverse set of technologies including Email, Twitter and Skype, oh and this thing called a telephone as well Smile
  9. Foreign Language Support Currently, Salience is English-only but it is on our internal product roadmap to expand this out this year. French and Spanish are going to be the thrust of our initial efforts, with more to be added as and when there is demand for them. Its our intention to support our full range of capabilities, including entity level sentiment in these languages.
  10. Openness and Transparency One of the thing that Lexalytics has always stayed away from is being a ‘black box’ type of solution, the solutions that customers are trying to build are complex beasts, and giving them a one size fits all, or a black box that solves the problem they are having at the current time, but will need a completely new black box to solve the problem that occurs tomorrow, doesn’t IMO fly in the current market place. To this end we provide almost endless customization options from changing fundamentals like tokenization, to adding an extra entity or a new sentiment bearing phrase. On the technical side we are more that happy to tell you how we are doing something from where we use Lexical Chain techniques to what our Grammatical parsers are actually doing to your document. We even supply a freely available (you can download it from our website) trial application so you can see that we can actually do what we say we can do.

Conclusion So how did we do? Well, we came up with the questions (though I do feel that they are a good representative set) and so feel we’re pretty solid in these areas. There is always room for improvement, however and on the short term development plan we have items such as more accurate entity extraction out-of-the-box, better sentiment analysis of short text, more text analytics features and faster processing. Finally we need to tag some other vendors for this, so lets go with say Leximancer, Crimson Hexagon, Clarabridge, Attensity, Jodange

Submitted by Christine Sierra on Thu, 2009-03-19 04:00

I’m often asked how someone would use our text analytics and sentiment software within their organization. Most inquiries come because a person knows that analytics on unstructured text is important. But they aren’t quite sure how to work it into their business in order to get meaningful results. Here are a few scenarios to show you how our software is used by our customers: Text Analytics - Enterprise Search Lexalytics Customer Case: Endeca Situation: You have streams of information coming into your company and it all looks the same. Sometimes you know what to search on, but other times you need to find the hidden information in all that unstructured content. How do you improve your enterprise search capabilities to get even better results Solution: Integrating text analytics into the enterprise search application will allow you to find information you didn’t even know existed because it can extract entities like people, places, companies, products and relationships - and that allows you to access information without necessarily knowing what question to ask. Reputation Management - Market Intelligence Lexalytics Customer Case: Cymfony Situation: You know people are talking about your company, your products and brands, yet you don’t have the time or resources to read a million blogs a day. How do you discover the sentiment contained within all the information out in the blogosphere? Solution: Lexalytics has spent years refining and improving the sentiment analysis software used by many of the vendors offering reputation management, today. By analyzing sentiment at the entity level, and not just the document level, we are providing more accurate results from your data. Classification - Taxonomy Lexalytics Customer Case: SmartBrief Situation: You know you have streams of content flowing into your company, but you can’t find an easy way to manage all the various buckets of information. How do you slice and dice the content without taking months to set up a new taxonomy? Solution: The key to our classification solution is the ability to easily edit the taxonomy to your business needs. We don’t believe one method of classification is the end all, be all, of a solution so we allow users to select from a variety of methods As John Harvey at KMWorld recently pointed out, there may be several reasons why text analytics is important to your company, but I wanted to share the three that we see the most in our business.