Jeff Catlin provided ZDNet's Jennifer Leggio a guest post on why companies need to be thinking about twitter speak and how it can be analyzed. Check it out here.
Jeff Catlin provided ZDNet's Jennifer Leggio a guest post on why companies need to be thinking about twitter speak and how it can be analyzed. Check it out here.
In a former life, I worked for Thomson Financial in Boston - First Call to be exact. Over the years there, I learned to enjoy the ebb and flow of earnings season - watching the markets move based on quantitative analysis. There was little analysis of research and text back then. Boy, things have changed. Today, Lexalytics is working with the grown-up version of Thomson Financial now known as ThomsonReuters. Our software is used in their algorithmic trading platform and incorporates sentiment into the trading process. The sentiment derived from the aggregated content flowing through a trading system has proven to be extremely useful in that environment. We also recently announced our relationship with First Coverage, which has a web service called "The Community". It essentially provides a collaboration platform for the buy-side and sell-side to help money managers filter out the noise and focus on the data that matter most to their holdings. It's very cool, especially if you are a quant-type, but more importantly it reinforces the shift from strictly numbers driven analysis and trading to a healthy mix of numbers and text analysis. We've known for a while that information stored in unstructured data can be helpful to monitoring corporate brands and reputation. In the PR/Marketing world there is pressure to be "exact" regarding sentiment for every document, but as Jeff noted last week in a Q&A session with PRWeek: "My belief is that over time the PR industry will begin to look at the aggregate statistics and when they do so, automated sentiment will be a perfect fit because it's consistent and accurate across large blocks of content." In the world of market trading and research, text and sentiment analysis technology provides the insight and measurement capabilities needed by financial services organizations to discover, react, and respond to market opinions. This is just one of the expanding markets we've seen in the past year embracing text analytics capabilities.
Recently I had one of those unfortunate circumstances where a customer of ours was unhappy with the results of the sentiment produced by Salience and contacted us to help tune the engine. The document set was a couple hundred articles mentioning a particular company. About 25% of the mentions scored by Salience as negative were rated by humans as mostly neutral in tone.
As I dug into the articles, it became clear that most of the articles where Salience and the humans disagreed were articles where the company in question was co-mentioned with lots of disappointing economic news - "credit crisis", "in a poor economic climate" and the like were frequent phrases. In addition, most of those articles had only 1 or 2 mentions of the company in question. Although there was often little grammatical connection between the negative phrases and the company, without any positive elements, the company score wound up weakly negative.
It seemed clear that this company was the victim of guilt by association. It was the "correct" result from an engine perspective, but not from a human one. What we decided to do was to introduce a scoring step that took into account the number of mentions and the number of scoring phrases - an output from Salience called "evidence." Essentially, what the client decided to do was to score as neutral any article that contained only a single mention of the company and only a single scoring phrase - regardless of what Salience returned as the score.
The idea is that, at bottom, machine scoring of tone is a statistical guess and when there is little to base the guess on, you're likely to be wrong. Humans react poorly to something scored as negative in these cases but are more accepting of a neutral score for these "passing mentions" even if they themselves might score it as a positive or negative. Another way of looking at it is that humans accept that sentiment is a continuum and what they really don't like is the machine to be a polar opposite.
After we did that (and a few other steps similar in nature) we managed to get the document set into quite close agreement with a human. What about other documents, though? Had we succumbed to the dread "overfitting" problem? To determine that, we assembled another group of several hundred documents picked at random from the rest of the set and had the same group of humans score the new documents with the new algorithm. We found that in fact the agreement with humans went up by more than 10% compared to the earlier scoring algorithm. Also important was the near elimination of "polar opposite" scores.
But what about Twitter, you might ask, or other kinds of short documents where you probably will only ever get a single mention? Should we score all those as neutral regardless? We weren't scoring Twitter and the like in this instance, but it's a valid criticism. My instinct is that with Twitter type content the original problem would not occur often. In other words, with only 140 characters, you won't have a lot of extraneous sentiment from poor economic news cluttering up the message.
As with any software, as soon as we released Acquisition 6.4 and Salience 4.1 we started looked at all the features we'd put in, and the features that were left on the editing room floor so to speak. We've been working since the release of AE6.4 and SE4.1 in April to bring these features into the product as well, and are now in the process of putting Service Pack 1 through its QA paces. Look for this Service Pack to be released on June 30.
At one of the sessions at the Text Analytics Summit 09, moderated by Katey Wood of 451 Group, several panelists engaged in a lively discussion about accuracy in text and sentiment analysis software. The most colorful comment came from Chris Bowman, former Superintendant at Lafourche Parish School Board when he noted, "If someone gave you an 85% chance you'd hit the lottery tonight, you'd take it." His point was well taken. There are several industries where 85% may not be good enough, but there are several more where that accuracy could significantly reduce the time and resources it takes to extract, apply sentiment and analyze text-based content. Quantitative data is all about the numbers and the absolute. Qualitative data is all about the words, tones, grammar and language and we are pretty confident in being able to extract valuable information from the text. And while entity extraction and applying sentiment to those entities helps the process along, it certainly doesn't end with that information. Further analysis through systems for predictive modeling, or reporting software, help to paint the full picture. Ask us for a proof of concept - can't hurt. And we know all data is different. If you are even *thinking* of using text analytics, you should know that there are pretty good odds you'll like what you get for results.
As the Text Analytics Summit draws to a close, I am watching many of the familiar faces that approached our exhibit table to learn more about Lexalytics grab their last cup of coffee and snack before heading out. We were happy to host a workshop this year, introducing the beta of our LexaScope product, due out in early July. This hosted web service will serve small to mid-sized businesses looking to extract themes, people, companies and sentiment from smaller amounts of data, beginning with a plug-in we developed to process data in excel. It was helpful to solicit feedback from the people we anticipate will try out the service when it is ready for prime time, and will take their comments and suggestions back with us to the development team. One-on-one feedback is still priceless. More to come when we solidify a launch date. Overall, the sessions were packed with attendees and many of them probably could have gone longer given the number of questions from the audience members. Sentiment analysis was a hot topic and there were many good case studies presented. We enjoy attending this event to see and learn from our colleagues and partners, and to hear what users believe is most important in their text analytics needs. SaaS is definitely the implementation of choice with direct end users like JetBlue, AOL and Monster.com. To see some more snippets from the conference, search Twitter under #textsummit. I was able to tweet some of the sessions from the event.
I attended the Enterprise Search Summit in New York last week, and there is no debating the fact that the economy is affecting attendance at conference events. I’ve been to ESS a couple of other times, and the decline in attendance this year was noticeable. Despite the decreased attendance, there were some interesting talks, and some noticeable trends. The push on “question answering systems” was the most noticable thing I saw at the conference. This included our lunch discussions and between session chats. I see their value in particular applications, but I don’t see it as the “panacea” that everyone was describing it to be. If you have content, or applications that are question based, then question answering systems are great, but they won’t help much with the age old “Tell me what I need to know?” kind of question. Of course I sat in on Seth Grimes talk on Sentiment Analysis, and I have to give kudos to him on two fronts. First, he did a nice job on explaining sentiment analysis in a clean and easy to understand way, and secondly, he got through 25+ slides in about 25 minutes. I would have bet my house that he wasn’t gonna pull that one off
There is a great blog post on The four stages of the average Twitter user by Jason Hiner. What I like about it is the idea that new Twitter users inevitably hate Twitter, until they love it. Isn’t that somewhat true of all the vehicles used to create content? I hate to date myself, but I remember tapping away at my electric typewriter trying to finish a 12 page report for school thinking how lucky I was I didn’t have to hand write it. But before I took typing lessons in high school, handwriting with pen on paper was the only way I could imagine to share my thoughts. Being introduced to my first data processing software nearly sent me into a rage, yet after some time and patience I realized that little flashing pipe on the screen was my friend not my enemy. And then being able to attach and send that masterpiece through email was like a Godsend. Imagine if all the data that was created was just ignored simply because we didn’t immediately buy into the method in which it was communicated? When I read that people believe Twitter offers no useful content, I tend to disagree. Because while the tweets themselves are short and concise, and sometimes uninteresting, the information that is attached via a link or through a series of tweets back and forth can be quite valuable. Millions of people are taking the time to think, create and share information and ideas. What makes that any less valuable than a handwritten letter to a business as a way to condemn or to congratulate them on a product? I remember sitting in a conference room in recent years (not at Lexalytics, mind you) listening to the argument that blogs were just the crazy opinions of the people on the fringe. No one read them. No one cared. I also remember shaking my head in disagreement because I never thought about the “how” but the “what” as being important. But regardless of whether you use or like the way in which the information is disseminated, it would be wise to collect, analyze and pay attention to all the innovative content being created and shared…whether it’s scanned on paper and stored electronically, or passed through a tweet on Twitter. Value can be found in lots of places, across lots of channels, you just need to find a way to separate the noise from the nuggets that matter to you or your business.
This will be the final piece on the basics of Text Analytics. I’ve covered the basics of categorization/classification, sentiment analysis and finally I’ll spend some time on entity extraction.
As I posted in Part 2, entity extraction is simply the process of extracting well understood types of proper nouns (People, Companies, Places,for example) from a block of text and labeling them with their appropriate type (John Smith as a person, for example). But what makes entity extraction useful is not necessarily the “what” that is extracted, but the way you then take that information and work to create new libraries of information, or how you append the information to create a better search solution or application. So, think of it this way. Within a document is a bunch of text that can be parsed based on grammar so you have nouns, verbs, adjectives, pronouns, etc.
Without going too deep into the process of parsing out this information, text analytics is able to identify pieces of the text that you may not have known existed within the documents (or blogs or whatever your source may be). By recognizing people, companies, places and even themes, you are able to find the value within the information without having to know what you were looking for in the first place.
We are working with The Financial Times Group on their Newssift site, which is the PERFECT example of how entity extraction can compliment a service or application. We provide them the ability for thematic extraction based on the corpus of data they have flowing into their system. So, in their case, when you start with a simple search based on keywords, you get a certain number of results for those keywords. What you also get is suggestions of ways to dig deeper into the content based on themes and other extracted information.
So the idea that text analytics can pull out that information is nice, but what you do with that information is what makes it really valuable. In Newssift’s case, they make a news site even more useful by offering up suggestions beyond the original search criteria. There is the power. As entity extraction techniques mature and improve, we should expect to see more creative and unique ways to analyze and process the data. Micro-blogging and messaging systems are changing the way we think about text and that will prove to be an influential factor in the text analytics space.
(Jeff recently did a post about the release of Salience 4.1 and the entity management toolkit. It was still in early stages and hadn’t been released. Now that we have the feature available, I thought I’d do a quick outline of the benefits. What do you do if you need to search information from within your corpus of data, but your data is unique and not driven by generic information used for every domain? If you are integrating enterprise search into your organization, then here is some information about how to enhance the benefits of your search engine. Enterprise search has become one of the most critical applications found within larger companies over recent years. This trend will continue to accelerate, but we’ve discovered that applications have to be able to provide corporate users with results that are tailored to their specific needs, not just on the general search recognizers available out of the box. Utilizing entity recognizers that “understand” the items that matter to a given business will allow search and discovery applications to expose valuable content to its users - and when doing so will be using the correct vocabulary. Beyond the obvious technical benefits of customer driven recognizers, the financial benefits to the organizations are compelling as well. Data preparation and content mapping typically represent the [b]largest part of an enterprise search implementation[/b], and the money for all this work typically goes to the search vendor. Allowing the users to build their own entity recognizers will reduce the amount of money spent preparing content for applications, and will give the users more control over how their applications are presented to their users or customers. The best way to understand user-defined entities is to walk through the build out of a user-defined recognizer. Let’s start by looking at the entity processing of a document with standard recognizers in place. For this example, we’ll work with premise of medical research documents, and build out a “Disease” recognizer. If we were to look at a medical document, only people, places and similar items would be discovered. We wouldn’t find the references to the possible diseases in the document such as:
If you were in the pharmaceutical or medical industry, having to create and train all the possible references to diseases would be the heaviest lift in implementing an enterprise search solution. New technology from Lexalytic’s Entity Management Toolkit can help users to build out a simple “Disease” recognizer from a fairly small set of medical research documents (typically 100+). A human would begin the process by identifying instances of “breast cancer” and possibly generic “cancer” terms within the document. Once the user had marked up most of the instances of “breast cancer” and “cancer” as diseases, they would then process it through the entity management toolkit. The system would then highlight instances it found to indicate possible disease references that the user has not yet accepted. In some cases, the user may decline the machines suggestion, for an example “Breast Cancer Awareness Tea Party” because this is an event not a disease. Noting that this reference is not a disease is important because it’s an additional piece of evidence that the Maximum Entropy model will use when it builds the disease recognizer. Once a document is marked up by the user, the state is changed to “Ready” and then additional documents are marked up until enough documents have been marked to build a model. The user would then apply the Disease Recognizer model and begin to process the rest of the documents for entity recognition beyond people, places or companies. One important point about the marked up documents is that none of them mentioned lung cancer by name; a user would rely on the model to discern that lung cancer is a disease because of the way it’s described within the document. As stated earlier, users have had to rely on the vendors to provide domain-specific recognizers up until now, but the vendors are not the content and domain experts. The value of this tool is in empowering content owners to expose the value found within their actual content, particularly in market and industry segments. We believe that publishers will be one of the first groups to see the value of this tool in differentiating their content from freely available content found on the web.