LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Seth Redmore on Wed, 2010-08-11 16:22

In an earlier post, Jeff Catlin described analysis that we did around Bally's vs. Bellagio using publicly available customer reviews. We did this analysis using something called "categories", which is basically a fancy name for search strings. The important aspect of this analysis was in finding the sentiment associated with different important aspects of the hotel experience - using a known set of categories. In other words, the actual terms for each of these categories were defined, and then we determined the sentiment of each category.

This technique is highly useful for ongoing monitoring of known service areas, and provides highly reliable recall of those areas. Where it is less useful is for discovery applications - where you aren't comparing on known areas, but instead trying to find out what's outside of your analysis area. Discovery is inherently a different process from monitoring. Discovery means that you don't know what might or might not be out there. In monitoring applications, it's generally important to have higher accuracy and recall, so that you can correctly monitor trends. In discovery applications, accuracy and recall are paradoxically less important - largely because they're not really relevant concepts. You can't define accuracy or recall for things that you don't know about. Once you define them, then you can examine the statistical rigor of your system. But, I digress.

We took sample data from a publicly available review site and extracted the relevant themes. The following table shows the name of the institution, a sentiment score (for the theme), and the theme itself. It is important to differentiate between the sentiment score for the theme and for the entities resident in the text - consider the following sentence: "President Barak Obama is doing a great job with this awful oil spill." Now, whether you agree with that statement or not, you can see how the entity "President Barak Obama" would get a positive score, while "oil spill" would get a negative score.

Hotel BLOOM! Positive trendy hotel
Hotel BLOOM! Positive been decorated
Hotel BLOOM! Positive not mean
Hotel BLOOM! Positive main shopping
Hotel BLOOM! Positive botanical gardens 
Hotel BLOOM! Positive centrally located
Hotel BLOOM! Negative long flight
Hotel BLOOM! Negative next day
Hotel BLOOM! Negative not sure
Hotel BLOOM! Negative ordinary coffee
Hotel BLOOM! Negative stale smoke
Hotel BLOOM! Negative not require
Hotel BLOOM! Negative been given
Hotel BLOOM! Negative never stay

Some of these themes are clearly more useful than others. It is important to also note that the sentiment of the theme is not necessarily included in the theme itself - it is inferred from the language around the theme itself. If I was coming into this cold, and didn't have anything defined, I would be sure to track themes around "decor" and "location" on the positive side, and "smoking" and "coffee" on the other side. the other themes would potentially mean more when referenced to the text itself, but this is more just to give some quick examples.

Here's another example:

W Seattle Positive well decorated
W Seattle Positive romantic getaway
W Seattle Positive tastefully decorated
W Seattle Positive centrally located
W Seattle Positive even walked
W Seattle Positive top quality
W Seattle Negative wireless connection
W Seattle Negative outrageous prices
W Seattle Negative local taxes
W Seattle Negative even get
W Seattle Negative poorly lit
W Seattle Negative never stay

 

Again, the location seems to be nice and decor is also well taken care of. On the negative side, I would watch for complaints around communication issues and pricing. However, this is the W, so what they need to watch for (as we demonstrated in the post) was that they're providing value for money. If you're not providing a proportionately better experience relative to the cost difference, guests are going to end up being dissatisfied. There's more detail to how themes are determined to be relevant, but this is enough information to give you a taste of how to use themes to discover what you should be looking for (or looking out for!).

Submitted by Seth Redmore on Tue, 2010-08-10 13:49

For the past year, we've experimented with using a web service to provide low-cost text analytics services, and we've learned a lot doing this. Our core strength as a company is providing fully customizable text analytics that are applicable to any industry, for a wide range of text analytics solutions. Lexascope was designed to fit a single use-case, that of longer form media content - for low cost entry into the media analysis market.

What we've learned is that working to a single use-case is not what fits our customers the best. Our customers desire the fully custom nature of our Salience engine. We've also seen an increasing desire to analyze short-form content, an application for which Lexascope was not designed.

As such, we've decided to focus our engineering resources on coming up with a lighter-weight installed version of Salience, one that has all the power of our current Salience, but allows for more pricing flexibility for users who need to process content using the full powers of our current engine, but don't have the content volumes to justify an enterprise-class license.

We've stopped accepting new registrations for Lexascope, and will work with current Lexascope customers to come up with a solution to their needs. We think that this new solution fits much better with our core strengths as a company, and allows us to focus on our core business of providing the best possible text analytics functions; while simultaneously enabling a larger set of businesses to take advantage of the strengths of our super-flexible, very powerful core software engine.

Submitted by Jeff Catlin on Thu, 2010-07-22 14:21

Hotel Reviews represent one of my favorite uses of text analytics. About five years ago we built a site with FAST that measured hotel reviews to build a “consensus opinion ” of hotels in a narrow geographic area. The idea was to give users of the site (shown below) an idea of what people thought of various hotels in a given area (Manhattan for example). It’s a nice application because it plays to the strengths of sentiment scoring, where a group of reviews are rolled together to form a concensus opinion. Automated engines are very accurate in such a use case (possibly more accurate than people), and they can handle a large volume of content.  

Recently we revisited the scoring of hotel reviews, and dove a bit deeper this time. Rather than simply generating a score for each property we scored the reviews for various features of the hotel, like location and staff and dining. For this test we used reviews for a couple of hotels in Las Vegas, the Bellagio and Bally’s and we measured the following features for each: - Rooms - Price - Facilities - Location - Cleanliness - Service - Overall An important aspect of this analysis is that the hotels are basically in the same location (right across the street from each other). When you examine the results (below), you’ll see that the hotels scored nearlythe same on location. This is a good test that the results are indicative of reality.

 

Digging deeper into the results, I was surprised to see that Bally’s had higher scores than Bellagio because Bellagio is one of the 5-star properties in Vegas, so we dug a bit deeper to make sure we weren’t scoring the reviews wrong. We focused in on the most positive and most negative reviews and tried to figure out why Bellagio wasn’t scoring higher. The chart below shows that the “happy campers” were equally happy with Bellagio and Bally’s the unhappy visitors were really unhappy with the Bellagio.

When we dug into the reviews we discovered that people expected more for their money than they were getting at Bellagio. Through the simple application of sentiment analysis on publicly available information, we show that companies can make these comparisons with much higher reliability, at minimal incremental cost, and with an unprecedented ability to adjust categories on-the-fly, either based on these results, or to test out new hypotheses. In fact, using this technique, we can move beyond the limitations of traditional approaches by running additional analysis to discover new, previously unmeasured categories based on recurring themes within the data. What this means for brands is that those who are able to leverage sentiment analysis will remain at significant advantage over their competitors, able to anticipate and proactively respond to how customers perceive their brand, much faster, more comprehensively, and significantly cheaper than existing methods.

Submitted by Seth Redmore on Mon, 2010-06-28 17:57

One of the two major new features in Salience 4.3 (releasing around June 30th) is "opinion mining". Opinion mining expands our core technology to handle indirect quotes. We've been able to extract quote-mark delimited quotes for a while now, and you could perform further analysis on those quotes (which were attached to the speaker).

Opinion mining means that Salience 4.3 can now handle sentences like:
1) Seth then asserted that this was a truly awesome feature.
2) Tim agreed that Bill was unduly angry.
3) Paul explained that the code was broken.

In each case there is a speaker, a topic, and sentiment expressed. The "speaker" is always an entity - and it could be a place, person, or company. The topic can be either a theme or an entity. Sentiment is assigned to the topic.

Thus, in sentence 1:
Speaker: Seth
Topic: awesome feature
Sentiment: positive

Sentence 2:
Speaker: Tim
Topic: Bill
Sentiment: negative

Sentence 3:
Speaker: Paul
Topic: code
Sentiment: negative

How does this work? I'm glad you asked. We have a data directory full of patterns for opinions. These basically come down to the following 3 classes:
1) "attributed" opinions (e.g. Paul said "This is great")
2) "cross sentence" opinions ("This is Great." Said Paul)
3) "unattributed" opinions (the examples above)

Unattributed uses a list of verbs that are expected to express an opinion, and looks for certain patterns using those. There are roughly 200 verbs that clearly express opinion (acknowledge, accuse, add, admit, advise, affirm, allege, answer...) and roughly 200 that have additional requirements because they indicate opinion only in certain contexts: (accept, account for, address, agree, allow, analyze...). To give an example of how this works, consider the following:

"Paul charged at George." vs. "Paul charged that George was incompetent." The "that" in the second sentence changes "charged" from being an action to indicating an opinion.

For those who update to 4.3, check the data directory: /data/opinions/*.ptn for all the patterns.

Try it out... we think that your opinion on this will be positive.

Submitted by Christine Sierra on Wed, 2010-06-23 20:15

Entity extraction in text analytics is the basis of the entire process - identify people, places, companies and themes and use them to better understand the content.

There are two types of entity recognizers that we have used in Salience 4 with much success:

Model based entities: this process looks at language and determines factors like parts of speech and extracts the relevant entity.

Customer driven lists: this process looks at a list provided by a customer to match and extract the relevant entity.

Our soon to be released Salience 4.3 is introducing query-based entities. This process takes into account the combination of words to make a match and extracts the entity based on that query.

For example, there is more than one Senator Udall in the United States. Mark Udall is a Senator from Colorado. Tom Udall is a Senator from New Mexico. If you had a query-based entity recognizer for "Tom Udall" you would create a query that includes the terms "Senator Udall" and "New Mexico" to determine that it must be Senator Tom Udall and not Senator Mark Udall.

While this is often compared with confidence-based entities, this isn't based on a confidence about the language, but an absolute.

Submitted by Christine Sierra on Tue, 2010-06-22 18:06

Sentiment Analysis solutions are popping up everywhere these days - or so it seems. Every day there is a blog post, or Twitter post (or 100), asking how it works or arguing a point about sentiment and what exactly it means. There has been an increase in articles covering everything from automated solutions vs. human analysis, to accuracy, to processing online content along with traditional content, to analyzing customer conversations. So, as a text analytics provider that has been offering sentiment analysis for years now, we thought it was about time we introduce a guide that organizations could use when they're trying to decide what they need for an analysis solution. Lexalytics is pleased to share the Lexalytics Sentiment Spectrum. It is a view of all the factors that may come into play when deciding which route is best for your company.

 

Our hope is that by looking at the various factors that go into extracting sentiment analysis, along with the different methods by which it can be implemented, it will become a little less confusing on which process may be best for your organization. The key questions we believe you need to ask include:

    • Is my data public or private? What type of security do I need on it?
    • Will I require any customization of dictionaries or integration to an existing application?
    • What does it cost?
    • What are my accuracy expectations?
    • Are we processing 100 documents a day or 100 documents a minute?
    • How many sources do I have flowing into my system?
    • Do I want to process online content? In house content? Both?
    • Does the solution give me sentiment of a whole document, or all the things contained within the document?

As we always do, we suggest you talk to a variety of vendors to review these key points and to ask for a possible proof of concept. Sentiment is inherently different for each company depending on what it is they need to analyze and accomplish - and how much human interactions is going to be involved. Some industries can use automated sentiment with little interaction at all and others need additional validation or customization to get the perfect results.

Submitted by Christine Sierra on Thu, 2010-06-17 18:35

I came across this great article by Frank Brown Ph. D from Accelrys (a company, I'll be honest, I had never heard of until today) in Information Management. He was describing the world of the scientific enterprise and how smart information management can help to strike a balance between content and context in R&D. In part, he wrote, "The term “business intelligence” has risen in the realm of information management for a reason. A collection of letters, numbers, figures or images are meaningless until processed in a way that makes the information understandable and usable. That’s what distinguishes raw data from true intelligence." I think regardless of industry, people are wondering how to get at silos of information and make them more useful. He continued with, "But when the available knowledge base includes an enormous breadth of sources, data formats and locations, relying on human processing alone is simply not feasible. This is where emerging technologies such as advanced semantic search and text analytics come in. These types of artificially intelligent categorization tools can help remove the time and cost constraints involved in extracting the context from complex content so that research collaborators can capitalize on all the valuable stores of data available to them – structured and unstructured, proprietary and public." If text analytics and business intelligence can help with the information management process for companies whose job it is to better the world with drug manufacturing or scientific break-throughs, what do you think it could do for your company?

Submitted by Christine Sierra on Tue, 2010-06-15 14:07

I came across an interesting blog post in Technology Review that showcased the Arizona Financial Text system (AZFin Text). According to the author, Christopher Mims: "...it works by ingesting large quantities of financial news stories (in initial tests, from Yahoo Finance) along with minute-by-minute stock price data, and then using the former to figure out how to predict the latter. Then it buys, or shorts, every stock it believes will move more than 1% of its current price in the next 20 minutes - and it never holds a stock for longer." What I find interesting is the spawn of discussions around anything related to trading and automated text analysis - it varies from ethical arguments to accuracy discussions to the human interaction factors. Regardless, text analysis in financial services is an area that has been progressing for years and seems to have been able to find some traction when applied correctly.

Submitted by Christine Sierra on Mon, 2010-06-14 18:29

For over 7 years we've been known as Lexalytics. I'll be honest, I didn't pick the name so I can't take credit. However, I often get asked what it means or where it came from. To us, it's pretty simple. You take Lexical: Main Entry: lexical Function: adjective Date: 1836 1 : of or relating to words or the vocabulary of a language as distinguished from its grammar and construction 2 : of or relating to a lexicon or to lexicography and add Analytics and you get Lexalytics. Since we analyze text-based content, or words, to provide additional analysis to customers, it was the perfect name. Lately, I've seen a little spike in other "lytics" popping up at conferences and online. For example, Social Analytics. Referred to as Socialytics. Good one. Today I saw Community Analytics - Communilytics. I'm certainly not claiming Lexalytics kicked off the world of "lytics" - that'd be silly - but what does encourage me is the fact that businesses and organizations are investigating different ways to analyze information. There has been a major transition in the past 5 years of content from scanned, stored and printed to blogged, posted and shared. As more and more channels are used to create and disseminate content, enterprises will need to explore all the various ways analytics will play into their infrastructure and applications. Some analytics are based on pages and pages of documents published in very specialized industries while some are comments and posts on very public domains. Whether you are analyzing Pharmaceutical data like our friends at Pharmalytics or simply exploring tweets and social content (Twitterlytics, perhaps?), there is a lot of power and information to be found within content. Try it for yourself. You may find a new form of "lytics" you can share with us all.

Submitted by Seth Redmore on Mon, 2010-06-07 22:03

Text analytics workshow 2-modified

View more presentations from Lexalytics.

Hi all! I (Seth) presented this at the 2010 Text Analytics Summit. It's a tour de force (or maybe not, but it's at least a little interesting) presentation about what we think makes for "perfect" text analytics, as well as a discussion on what we're doing over the next 12 months to make our stuff even more perfect. There are a few notable tidbits in here, like the fact that we're going to be supporting foreign languages. (You have to watch the preso to find out which one we're doing first :) ). Sarcasm! Who hasn't run up against the "Well what do you do about Sarcasm?" question? We have an answer... There's other cool stuff in there too. Check it out!