LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Christine Sierra on Mon, 2009-05-11 04:00

There is a great blog post on The four stages of the average Twitter user by Jason Hiner. What I like about it is the idea that new Twitter users inevitably hate Twitter, until they love it. Isn’t that somewhat true of all the vehicles used to create content? I hate to date myself, but I remember tapping away at my electric typewriter trying to finish a 12 page report for school thinking how lucky I was I didn’t have to hand write it. But before I took typing lessons in high school, handwriting with pen on paper was the only way I could imagine to share my thoughts. Being introduced to my first data processing software nearly sent me into a rage, yet after some time and patience I realized that little flashing pipe on the screen was my friend not my enemy. And then being able to attach and send that masterpiece through email was like a Godsend. Imagine if all the data that was created was just ignored simply because we didn’t immediately buy into the method in which it was communicated? When I read that people believe Twitter offers no useful content, I tend to disagree. Because while the tweets themselves are short and concise, and sometimes uninteresting, the information that is attached via a link or through a series of tweets back and forth can be quite valuable. Millions of people are taking the time to think, create and share information and ideas. What makes that any less valuable than a handwritten letter to a business as a way to condemn or to congratulate them on a product? I remember sitting in a conference room in recent years (not at Lexalytics, mind you) listening to the argument that blogs were just the crazy opinions of the people on the fringe. No one read them. No one cared. I also remember shaking my head in disagreement because I never thought about the “how” but the “what” as being important. But regardless of whether you use or like the way in which the information is disseminated, it would be wise to collect, analyze and pay attention to all the innovative content being created and shared…whether it’s scanned on paper and stored electronically, or passed through a tweet on Twitter. Value can be found in lots of places, across lots of channels, you just need to find a way to separate the noise from the nuggets that matter to you or your business.

Submitted by Jeff Catlin on Mon, 2009-04-27 04:00

This will be the final piece on the basics of Text Analytics. I’ve covered the basics of categorization/classification, sentiment analysis and finally I’ll spend some time on entity extraction.

As I posted in Part 2, entity extraction is simply the process of extracting well understood types of proper nouns (People, Companies, Places,for example) from a block of text and labeling them with their appropriate type (John Smith as a person, for example). But what makes entity extraction useful is not necessarily the “what” that is extracted, but the way you then take that information and work to create new libraries of information, or how you append the information to create a better search solution or application. So, think of it this way. Within a document is a bunch of text that can be parsed based on grammar so you have nouns, verbs, adjectives, pronouns, etc.

Without going too deep into the process of parsing out this information, text analytics is able to identify pieces of the text that you may not have known existed within the documents (or blogs or whatever your source may be). By recognizing people, companies, places and even themes, you are able to find the value within the information without having to know what you were looking for in the first place.

We are working with The Financial Times Group on their Newssift site, which is the PERFECT example of how entity extraction can compliment a service or application. We provide them the ability for thematic extraction based on the corpus of data they have flowing into their system. So, in their case, when you start with a simple search based on keywords, you get a certain number of results for those keywords. What you also get is suggestions of ways to dig deeper into the content based on themes and other extracted information.

So the idea that text analytics can pull out that information is nice, but what you do with that information is what makes it really valuable. In Newssift’s case, they make a news site even more useful by offering up suggestions beyond the original search criteria. There is the power. As entity extraction techniques mature and improve, we should expect to see more creative and unique ways to analyze and process the data. Micro-blogging and messaging systems are changing the way we think about text and that will prove to be an influential factor in the text analytics space.

Submitted by Christine Sierra on Tue, 2009-04-21 04:00

(Jeff recently did a post about the release of Salience 4.1 and the entity management toolkit. It was still in early stages and hadn’t been released. Now that we have the feature available, I thought I’d do a quick outline of the benefits.  What do you do if you need to search information from within your corpus of data, but your data is unique and not driven by generic information used for every domain? If you are integrating enterprise search into your organization, then here is some information about how to enhance the benefits of your search engine. Enterprise search has become one of the most critical applications found within larger companies over recent years. This trend will continue to accelerate, but we’ve discovered that applications have to be able to provide corporate users with results that are tailored to their specific needs, not just on the general search recognizers available out of the box. Utilizing entity recognizers that “understand” the items that matter to a given business will allow search and discovery applications to expose valuable content to its users - and when doing so will be using the correct vocabulary. Beyond the obvious technical benefits of customer driven recognizers, the financial benefits to the organizations are compelling as well. Data preparation and content mapping typically represent the [b]largest part of an enterprise search implementation[/b], and the money for all this work typically goes to the search vendor. Allowing the users to build their own entity recognizers will reduce the amount of money spent preparing content for applications, and will give the users more control over how their applications are presented to their users or customers. The best way to understand user-defined entities is to walk through the build out of a user-defined recognizer. Let’s start by looking at the entity processing of a document with standard recognizers in place. For this example, we’ll work with premise of medical research documents, and build out a “Disease” recognizer. If we were to look at a medical document, only people, places and similar items would be discovered. We wouldn’t find the references to the possible diseases in the document such as:

    • Lung Cancer (Adenocarcinoma)
    • Leukemia
    • Lymphoma

If you were in the pharmaceutical or medical industry, having to create and train all the possible references to diseases would be the heaviest lift in implementing an enterprise search solution. New technology from Lexalytic’s Entity Management Toolkit can help users to build out a simple “Disease” recognizer from a fairly small set of medical research documents (typically 100+). A human would begin the process by identifying instances of “breast cancer” and possibly generic “cancer” terms within the document. Once the user had marked up most of the instances of “breast cancer” and “cancer” as diseases, they would then process it through the entity management toolkit. The system would then highlight instances it found to indicate possible disease references that the user has not yet accepted. In some cases, the user may decline the machines suggestion, for an example “Breast Cancer Awareness Tea Party” because this is an event not a disease. Noting that this reference is not a disease is important because it’s an additional piece of evidence that the Maximum Entropy model will use when it builds the disease recognizer. Once a document is marked up by the user, the state is changed to “Ready” and then additional documents are marked up until enough documents have been marked to build a model. The user would then apply the Disease Recognizer model and begin to process the rest of the documents for entity recognition beyond people, places or companies. One important point about the marked up documents is that none of them mentioned lung cancer by name; a user would rely on the model to discern that lung cancer is a disease because of the way it’s described within the document. As stated earlier, users have had to rely on the vendors to provide domain-specific recognizers up until now, but the vendors are not the content and domain experts. The value of this tool is in empowering content owners to expose the value found within their actual content, particularly in market and industry segments. We believe that publishers will be one of the first groups to see the value of this tool in differentiating their content from freely available content found on the web.

Submitted by Jeff Catlin on Mon, 2009-04-20 04:00

In my last post about Text Analytics, I described the more classification and concept-oriented pieces of Text Analytics. In this post, I’m going to outline the pieces that most people think of when they think about Text Analytics: Entity Extraction and Sentiment Analysis.

Let’s start with the piece that Lexalytics is best known for - Sentiment. The measurement of sentiment in content seems to be all the rage these days, but in spite of this, very few of our prospects really understand what the technology will and won’t do for them. So, I’ll start at the beginning.

What does it mean to measure sentiment?
That depends entirely on the intentions of the user and the content being measured. If you’re looking at review data (let’s say hotel reviews in this case), then you’re probably thinking about the overall sentiment of each review for the hotel being discussed, which would be an example of document sentiment. If, however, you’re reading a publication like Consumer Reports, then you’re probably thinking more about how the different hotels stack up against one another. In this case, the overall document sentiment is pretty much useless. The document will have some good and some bad content. In fact, what the reader cares about in this kind of content is the tone for each specific hotel that’s being described in a sub-section of the document. This is known as entity-level sentiment. Lexalytics’ sentiment analysis can provide both document and entity-level sentiment, so you’re covered in either scenario. 

What really matters in sentiment analysis?
The overall accuracy within the application is important. An example where a technology-based solution really shines is in financial services where the trends across a collection of stories are what users are most interested in. Financial Services is definitely one of the up and coming industrial uses of sentiment because the technology tends to perform better than humans in processing collections of content. Also, Reputation Management is another industry where automated sentiment analysis shines bright. It could be said that automated sentiment analysis was born in this space, and was invented because of the amount of time people spent hand measuring the tone around products and brands. While Reputation Management is currently the biggest market for the technology, it’s probably not the best example of accuracy. It’s hard enough to get humans to agree with humans on the tone for a specific story, but to get people to agree with a computer is even harder. I bring up these two contrasting uses because it’s important for people to think about their specific needs and requirements before they jump into using any vendor’s solution. Make sure the solution you’re looking at is well-suited for the problem you’re trying to solve. 

What is entity extraction?
While sentiment scoring is the “hot” topic in our space these days, entity extraction is sort of the meat and potatoes feature that every text analytics vendor needs to provide. Entity extraction is simply the process of extracting well understood types of proper nouns (People, Companies, Places,for example) from a block of text and labeling them with their appropriate type (John Smith as a person, for example). What makes this topic more interesting these days is that a number of vendors, Lexalytics included, have significantly improved their entity recognition technology in recent months to utilize techniques like “grammatical parsing” and “Max Ent” models to do a better job of extracting entities. I did a complete post a little over a month ago about our new Entity Management Toolkitwhich explains how users can now build their own entity recognizers. We aren’t the only ones pushing hard on entity extraction, other companies are working on this as well. Especially on grammatic parsing using anaphora resolution where “John Smith” and “He” are recognized as the same entity. I hope this quick overview provides you with a bit of a background on the basic technology and uses for Text Analytics. I will, from time to time, write a new post on some of the up and coming additions to the space, like relationship extraction, fact extraction and short document (think Twitter) processing.

Submitted by Christine Sierra on Fri, 2009-04-17 04:00

I’ve been thinking lately about how tweeting has changed my work day routine. By some accounts, I’m probably a late starter to the world of Twitter. To others, I must have a lot of time on my hands if I can spend the day on that social network. And yet many still think tweeting is only done by birds. It’s amazing how much has changed in the last 6 months. When I joined Lexalytics two years ago, there wasn’t a marketing resource in the company. They had been working feverishly to enhance the technology. I was employee number 7 or 8 and it was a challenge to identify the core market for our software. Where were we going to fit and how was I going to present our software to the world? About the same time, the social media sites and social networks were finally emerging as reliable sources of information and opinions. What used to be a scratch-and-peck search process to uncover valuable market news and research was now more readily available through alerts and RSS feeds. And then there was talk of Twitter. What the *tweet* was it all about? It wasn’t until the Boston New Marketing Summit in 08 - now known as the Inbound Marketing Summit - when fascinating communicators like Laura Fitton (@pistachio) and Chris Brogan (@chrisbrogan) simplified the tweeting process, that I started to get more serious about it. Mix in a little push from my PR firm’s social media champion Joel Richman (@xylem) and suddenly I was a twitter-er. I had actually been dabbling in Twitter with a network of TwitterMoms so I had some idea on the social aspect, but not the power of the business aspect. Now that I’ve been tweeting a bit longer, and do still wonder “Who really cares what I tweet?”, I thought I would at least outline how it has changed my everyday work day and the value I see out of joining the world of Twitter. Believe it or not, I do see an efficiency factor to it all. Sort of like the introduction of call waiting to the phone. Remember? Before it existed we’d get busy signals and have to try calling back, over and over again. How did we ever get anything done at work? I also discovered the following along the way:

    1. Community Mind Share: I’m still a firm believer that face-to-face interactions are irreplaceable, but it’s hard to ignore the power of a platform where you can jump into a community and engage with people through a variety of opinions, reactions, questions, answers and statements. I don’t know if I believe having a million followers makes you a better member on Twitter, but I do know that interacting with your community however big or small, opens up an incredible mind share. 140 characters can lead you a world of knowledge - either in text, video or pictures. Join in!! All it took was following two or three people on Twitter, and I would scan their following lists and grow from there. Pretty simple.
    2. Networking Opportunities: Where are you going to be tonight? Unless you were on the VIP email distribution list, it was often hard to know where folks were gathering, and what they may be discussing. Thanks to TweetUps and other networking announcements, it’s easier and more efficient to plan your calendar and join in that face-to-face mind share that still exists. In our area, @BostonTweetUp has been a fantastic source. But sometimes it’s just one person, like @BobbieC tweeting about her Mass Innovation Nights (@MassInno) that can get you hooked.
    3. Faster Research and News: Breaking News! While some items must still be taken with a grain of salt, most of the published blog posts and news stories linked through the 140 character tweets offer insightful information and resources. And, they are arriving at my desktop faster than ever before. Want to get quick tips to Tweeting? You got it. How about new product releases or beta sites in your industry? Done. Want to interact with journalists or breaking news agencies? You can do that, too. Traditional news sources such as CNN (in case you hadn’t heard) are on Twitter @cnnbrk and more innovative sources like @mashable are there, too. Take your pick.
    4. Lead Generation: Seriously. You can find leads on Twitter. I swear. And not by being a spam-tweeter shouting about your super-biotic health pills, but by actually engaging in conversations. About anything. Sometimes that random common thread turns into a potential lead. ROI has been a big topic, but I think if companies would sit back and think how long it used to take for the “perfect marketing collateral” to hit a prospects mailbox, snail or electronic, read, reply (maybe) and engage - well, you get the point. What do you lose by talking - about work, weather, traffic or food (Jeff Cutler (@jeffcutler) can attest to the interest in a picture of a breakfast plate)? You just never know who’s listening (or watching)!
    5. Honest Inspiration: The world is a pretty gloomy place these days. Sorry to be the one to tell you. Bad news swirls the local and network news every night. I’ve found on Twitter I’m inspired by stories of hope, charity, laughter and encouragement. I have never logged off thinking “Wow. Twitter was a bummer today.” And I believe having a positive outlook, all around, can only help motivate you in the workplace.

I gain knowledge, insight and inspiration from many different tweeters every day, all of whom I’d love to send your way. But instead, just check out your favorite person’s following list and discover some on your own. And while I try to focus on learning more about work-related information during the 9 to 5, there’s always a little time to play. I mean it is only 140 characters - and I do work from home - so it’s my virtual water cooler to talk about American Idol or the funny at-home-mishap. Now, time to get back to tweeting! And if you’re curious, you can find me at @christinelexa or our company at @lexalytics

Submitted by Jeff Catlin on Mon, 2009-04-13 04:00

I’m sure this seems a strange title for a post, given that Lexalytics and this blog have been around for quite a while now, but I’ve been noticing a shift in the questions I get about our technology. Until recently most of the folks that we talked to had a very good idea of what a text analytics engine provided, but as Text Analytics has gone more mainstream I’ve seen an uptick in folks needing a briefing on the basic definitions of the various features like classification, categorization, entity recognition, themes, concepts, sentiment and relationships. So, I’m going to use a couple of blog posts to provide a basic overview of these concepts. I’m going to start with Classification, Categorization, Themes and Concepts, and then use my next post to review Entity Extraction, Sentiment Scoring and Relationships. Fair warning… I may raise the blood pressure of a few librarians with my rather loose definitions, but I think people need general use definitions to help them map these concepts to their day to day business needs. So let me start by annoying the librarians right out of the blocks… For most of the world Categorization and Classification are the same thing. I’m going to stick with the term classification, but it could just as easily be categorization. The simple definition is:

            • Classification is basically a group of technologies designed to fill in the missing piece of the following: “This document is about BLANK”.

Of course its rarely as easy as that, but the basic idea of Classification/Categorization is to put a given document into a subject oriented bucket. These buckets may be organized as a flat set of topics, or as a more complex hierarchy on of taxonomy nodes, like:

    • Sports
      • Baseball
      • Football
    • Basketball

Themes or Concepts are sometimes confused with Classification/Categorization but are an entirely different animal. Themes are noun phrases that occur within a document. As an example consider a document about the current state of the economy. The document might be classified as “financial results”, but the themes or concepts from the document would include things like:

    • banking crisis
    • toxic assets
    • stimulus plan
    • bad assets

The use of classification/categorization and concepts/themes are 2 different mechanisms to help users understand what a document is about. A great way to examine and understand these 2 different text analytics features is to try them out on a site like Newssift where you can navigate financial news content via both categories and concepts.

Submitted by Christine Sierra on Fri, 2009-04-10 04:00

Text Analytics Leader Lexalytics Releases Salience 4.1 Offering Entity Management Toolkit for Better Entity Detection Company’s release provides support for user-generated entity extraction models that will save businesses money and offer more control over presentation of information. Amherst, MA April 6, 2009 — Lexalytics, Inc. (www.lexalytics.com), a software and services company specializing in text and sentiment analysis, announced today it has released Salience 4.1. This latest version of the popular Salience Engine includes a powerful Entity Management Toolkit used to build and train entity recognizers by understanding the user’s data. This feature will reduce the amount of money spent preparing content for applications, and will give the users more control over how their applications are presented. Currently, Salience technology is widely used professionals in the marketing, public relations, enterprise search and financial services industries to manage brand reputation and improve the business value of information. Salience 4.1 will solve entity extraction problems for companies that process specialized content. The Entity Management Toolkit easily allows users to mark up documents specifying entities that are important to them, and the system will then make suggestions on other entities. Once enough documents have been marked up to build a model the system is ready to process the entire corpus of data. The user doesn’t need to write a single line of code for this process to work. “With the release of Salience 4.1 and the Entity Management Toolkit, we have improved the way companies create their own intellectual property,” said Jeff Catlin, CEO, Lexalytics. “Until now, users have had to rely on the vendors to provide recognizers, but the vendors are not the content and domain experts. The key to this release is in empowering content owners to expose the value of their own content, particularly in market and industry segments.” Salience 4.1 also includes separation of user customizations from default data files for easier management, and performance improvements to document processing time. A free trial version of the Lexalytics text analytics software is available at www.lexalytics.com.

Submitted by Christine Sierra on Thu, 2009-04-09 04:00

It has been a busy couple of weeks as I have been emerging from hibernation and finding my way back out into the wild. There seems there have been a variety of events happening in and around Boston signaling the renewed spirit that comes with spring. It has been refreshing. We exhibited at the kick off event for Mass Innovation Nights last night in Waltham. It was a pleasant surprise to see the number of people milling around the Charles River Museum of Industry. It was especially fun for me to return to my hometown and see what has changed and what is the same. Walking down Moody Street was pretty surreal. All this activity also got me thinking about my own social habits and what makes me most comfortable when I network. I attended the MarketingProfs Digital Marketing World last week which was literally a tradeshow online, complete with exhibit booths and background chatter. And while I enjoyed the sessions and thought they did an excellent job pulling it together, I found it incredibly rude of the attendees that were in the virtual auditoriums using their “semi-anonymity” to make fun of and mock the presenters. I highly doubt they would be doing the same thing if they were sitting in a room listening to that person, so why did they feel they could do it virtually? Has the social network aspect of our society made it acceptable to publish unflattering or disrespectful remarks because we can stay behind a wall where no one really gets to know us? Last night was the complete opposite. This event was marketed solely through social media. We blogged, tweeted and retweeted to get supporters. I know our company was excited to see the outcome. What I found even more exciting was that I was going to have the chance to shake the hands of people who, up to this point, had been a 1X1 photo on Twitter. I don’t live close to Boston proper so I have missed many of the @bostontweetup events in our area, but this night certainly proved that you can mix social media outreach with real-life interactions and find success. I’ll probably continue to attend the virtual events as I think they have some benefits, particularly a clear cost benefit of not having to travel the country to attend, but I think those that attend should keep in mind that if you wouldn’t say it to someone in person, you probably shouldn’t type it about them online. Very distasteful. Check out other reactions to Mass Innovation Nights from TweetWorks, The Lost Jacket or search #MIN on Twitter.

Submitted by Christine Sierra on Fri, 2009-04-03 04:00

On Wednesday, April 8, Lexalytics is pleased to be taking part in the launch of Mass Innovation Nights. We will be showcasing our latest release, Salience 4.1, and answering any questions you may have about how text or sentiment analysis can help your organization or business. Mass Innovation Nights are a series of events designed to harness the power of social media to showcase Massachusetts-based innovation with a monthly event at the Charles River Museum of Industry and Innovation in Waltham, 154 Moody Street in Waltham, MA (www.crmi.org). We will be joined by these other Charter Members:

    • IBM/Lotus
    • Invention Machine
    • LuckyCal
    • Mass High Tech
    • Noteflight
    • Practical Solar
    • Tweetworks
    • WiBlocks
    • Xelago

We hope to meet you there!

Submitted by Mike Marshall on Thu, 2009-04-02 04:00

As part of our ongoing plan to increase our coverage for support requests, I’ve recently had to take the plunge into Linux. I’ve always been a windows guy and that’s where I’m most comfortable (Kevin on the other had is pretty much the reverse, living in his SSH sessions) but needs must so Linux it is. Not wanting to go the whole hog at this point I decided to just create a virtual machine using Virtual PC, that Ubuntu would be my distribution and (foolishly)thought it would be an afternoons worth of work. Silly me! Virtual PC, whilst fantastic with Windows installs, doesn’t seem to play as well with Linux and to start with I couldn’t even get the installer from the live ISO to run. However Google as always comes to the rescue and I found a fantastic article on Techrepublic that showed me what command line params I needed to add to get the installer up and running and an hour or so later I had a fully working Ubuntu VM, all be it at some poor graphics resolution. Of course that was only the first step, I then needed to get SVN onto the machine (managed that first time), download all of our source and then build it. This is where I hit my next major wall. I grabbed the Boost package we use (1.34.1) and went to compile it and hit error after error. Thanks have to go to Kevin for sitting on the end of Skype helping me diagnose them, but they included

    • I had gcc installed by default, but needed to install g++ as well
    • Missing development libraries for python, bzip and zlib
    • Problems with Boost 1.34.1 and gcc 4.3, this required applying a patch file we found on the Boost bug db (had to install patch as well Smileand then going into various regex headers and adding #includes

However after several hours slog I got Boost to build, created the necessary symbolic links and was ready to build Salience. Sure it will be easy says Kevin, mumbling that he hasn’t actually built it with gcc4.3 before. Nope, not easy. From where I have got to at the moment it looks like headers which are included by default in earlier versions of gcc are no longer included so its going to be an edit job to get it up and compiling. Don’t know about running yet, though my task tomorrow will be to download our new tarball and see if I can get that to work just following the instructions in the wiki One last thing, after doing my install Ubuntu wanted to install 273 patches, found that very amusing for some reason. More later.