LexaBlog

Our Sentiment about Text Analytics and Social Media

Submitted by Mekkin Bjarnadottir on Wed, 2014-04-23 20:14

April 23 marks the 450th anniversary of the birthday of one William Shakespeare of Stratford-upon-Avon, who not only wrote a multitude plays and sonnets that are still being studied and performed, but also invented an astounding number of words and phrases that are now in common use. In the past two centuries, the legendary figure has come under scrutiny. Unable to reconcile Shakespeare’s supposedly rudimentary education with his wildly eloquent writing, some began to propose that Shakespeare was not the man everyone thought him to be. And thus began the great Shakespearean Authorship Question 

For a long time, the debate was waged in the trenches of history and subjective literary analysis, but some shrewd young data scientists have applied text analysis and data mining techniques in order to shed further light on this continuing debate. 

Researchers from Dartmouth used text analysis to comparatively analyze Shakespeare and a few of his contemporaries, Sir Francis Bacon, Christopher Marlowe, and Edward de Vere, Earl of Oxford, all of whom have been offered as alternative authors for Shakespeare’s work. By analyzing character usage, word length, and the percentage of unique words used, they were able to definitively rule out both Bacon and Marlowe as probable authors. The study seems to suggest Edward de Vere as the most likely candidate, but unfortunately, the corpus of his work is so small that accuracy cannot be guaranteed. 

In any case, the debate over Shakespeare’s authenticity will likely rage on, unless some new key evidence is discovered or Shakespeare himself rises from the grave. As for myself, I’m hoping that he was nothing more or less than what he seemed: a writer, an actor, and an upstart crow.

Submitted by Seth Redmore on Thu, 2014-04-10 12:32

Mullets. If you have one, they're awesome, and they're certainly visually gripping.  

In recent weeks, we've been getting a lot of requests for "creating Tag Clouds with Salience." This has caused a lot of consternation and angst amongst us, because we feel that we're failing our customers. 

First and foremost, if you don't know what a word cloud is, go here:  http://www.wordle.net/

Now, try to forget that you've even seen them. Here's a blog post that's going to tell you, in gory detail, just why they are so bad:  http://www.niemanlab.org/2011/10/word-clouds-considered-harmful/

To summarize this article, and to put this in our own words, it comes down to this: you can't measure anything from them. You can't compare them. They are form without much in the way of function – the packing algorithms simply try to get all the words into the space, and if you remove one word, the cloud will look completely different.  

Here's a wonderful example of someone who really likes word clouds, and from Stanford nonetheless:  https://dhs.stanford.edu/algorithmic-literacy/using-word-clouds-for-topic-modeling-results/

Our experience is that text clouds are the visualization that people ask for when they don’t really have an idea of what they are trying to show and don't understand the questions they're going to have to ask.

Sort out your question first, and then understand how you want to visualize, don't just jump to a word cloud because they're pretty and look "texty."  

Submitted by Paul Barba on Tue, 2014-04-08 12:28

There's a lot of ways to tune Salience: from tweaking a few, simple sentiment weights to developing custom patterns that explore the relationships between complex grammatical patterns. The quickest and easiest ways to tune the engine to your own needs is through our options. These are simple choices you can make to tell the engine how you want your results processed.

Options are set via the API, so a few simple lines of codes change what the engine does. You can check here to see how options are set in the programming language you're using. 

Here are five options you might consider changing, based on your text analysis needs:

 1. Tagging Threshold

Some data sources are filled exclusively with high quality, interesting content. Other feeds...aren't. If you're worried about corrupted articles filled not with words but arcane computer symbols throwing off your counts, consider changing our tagging threshold. This defines the minimum percentage of actual text characters the engine requires to do any processing. On the flip side, if your content does contain many odd characters for a legitimate reason, this option might need turning down.

2. Process HTML Content

If you're passing html into Salience, this is an option you should definitely set. When turned on, we'll strip out all the tags and even do our best to differentiate the content of a page from the sidebar and inserted advertising. An important caveat! If your content is not actually html, setting this option will still activate the advertising stripping algorithm and may cause you to lose some content. So make sure to turn it off when you're done processing html.

3. Anaphora Resolution

What we call anaphora, you might call pronouns, and Anapohra Resolution means counting those pronouns as instances of the mentions they refer to. This is usually what you want, but you can turn this feature off if you prefer. Sometimes you're more interested only in how often a company or person was called out explicitly by name.

4. Neutral Upper and Lower Bound

When processing an individual document, Salience returns raw sentiment scores for you to process as you desire. What's the cutoff for a 'positive' mention versus a 'negative' mention? How many gradations are desired? It varies from use case to use case, so we leave that choice to you.

When processing a collection, though, Salience will group your results for you. These options allow you to specify the score cutoffs used for what a 'positive' mention and a 'negative' mention. As with other options, when there's no right answer we leave the final decision in your hands.

5. Theme Topics

We've talked about options for managing content quality, options for managing different file formats and options for controlling how your results are calculated. One final class of options can be used to get a little more speed out of Salience. Theme Topics are an example of an interesting feature Saliene calculates, but which you don't have to pay a performance cost for if you aren't using.

You can write Boolean Queries or Concept Queries to discover which documents discuss which topics. We've also got themes, which are noun phrases that convey ideas discussed in an article. If you want to know which themes go with which topics, Theme Topics tell you just that. If, though, you aren't using this feature, the calculations performed by Salience to discover them aren't of any benefit to you. So turning this option off will give theme calculations a small boost.

This is just a sampling of the options available in Salience, check out the full list if you want to see what else is available, and keep an eye on our release notes for new options in new releases of Salience. 

Submitted by Tim Mohler on Thu, 2014-04-03 20:11

Text analytics is not a perfect art, and very occasionally it fails so spectacularly that we feel we really need to start sharing these rare moments with you. Thus we present to you our first installment in “Text Analytics Fails”. Enjoy!

Here we were, categorizing hotel reviews into lifestyle categories such as "romantic" and "trendy." Our categorization model for this particular set of data told us that we could expect "weekend away" and "romantic weekend" to be good clues that the review was about romance. Our favorite fail so far on this data set was the following review, categorized by us as "likely romantic":

"Okay, so the reviews on here may not be the best, however for you pay this is a decent place to lay your head. Its clean and decent . No frills and the location could not be better. Ideal for a lads weekend away. Take your partner for a romantic weekend and expect a divorce."

This is a good example of both negation (inferred by the divorce comment) and the importance of context ("lads weekend away" is not the same as "weekend away" by itself). Adding something like "NOT divorce" to the categorizer likely doesn't help a lot since this is both a one-off, and there are plenty of mentions of divorce along with romance - "I'm a divorced single and looking for a romantic place to take my new interest."

 If you're interested in how consistency can matter more than perfect accuracy, check out Craig Golightly's talk from 2013's Lexalytics User Group

Submitted by Mekkin Bjarnadottir on Tue, 2014-04-01 17:10

Lexalytics is pleased to announced the development of their newest Salience driven program, Deliverance, a comprehensive text analytics interface designed to extract meaning from educational texts and compile thematic reports. This development follows at the heels of a radical shift in Lexalytics’ targeted consumer base, now primarily children grades K-12. Deliverance will enable students to easily input required school reading materials and automatically create book reports.

“We just listened to what our new target demographic was saying, on Twitter, on Facebook, everywhere, and we found a near universal dislike of homework. So we thought, why not automate an arduous process? It takes seconds for Deliverance to read what would take a student days. This breakthrough innovation allows kids to get back to what really matters, like Mass Effect 2.” Says VP of Product Placement, Seth Redmore.

Lexalytics has built upon successful existing Salience features, such as theme extraction, summarization, and entity recognition to develop Deliverance , which Redmore sees as the next logical step for the company. New features include the ability to input a list of vocabulary words that the student is required to use. Since book reports are based on text analysis and deriving  themes, the biggest leap here, he says, was simply choosing to explore new markets to deliver insights into. 

Deliverance is still in early beta, and will be released fully at a later date. “We’re very pleased with what we’ve been able to accomplish so far,” he continues , “but it’s still early days. Our biggest challenge so far has been modulating the level of discourse so that it is appropriate to each age group. Still, I find Deliverance’s analysis of the “Very Hungry Caterpillar” as a capitalist manifesto expounding the transformative joys of consumerism quite convincing.”

Submitted by Rami Nuseir on Thu, 2014-03-27 16:28

My assignment was simple. Pick a city, run a Twitter search for #cityname, and see what Salience comes up with on the tweets I extract.
I live in the lovely city of Montreal, so I collected 10,000 tweets containing #Montreal, to see what's going on around my city.

You know what I found?

SPAM! SO MUCH SPAM! It seemed like every other tweet was selling something by Nissan or Duracell or Adidas or Diamond Supply Co. or some other brand.

After a very disappointing first pass, I ran a quick, no special settings discovery analysis to see if something stood out.

Facets:
clouds: 64
morning: 30
wind: 23
missjuliexxx: 16
nicebabe111: 14
weather: 12

Ok, lots of weather stuff. Unsurprising there, we had a brutal winter. Nicebabe111 and missjuliexxx were a bit of an anomaly though.

Fearing the worst, I dug a bit deeper, and looked at the tweets they were each in. You'll never guess what I found: ESCORTS! Hundreds and hundreds of tweets with #escorts in them, 196 to be precise.

I kind of expected this. Montreal is well known as a city of vices, with our strip clubs, escorts and casinos being really popular. I guess I didn't expect them to advertise on Twitter is all :)

Entities:
habs: 311
toronto: 125
goharshahi: 101
quebec: 57

This section isn't too bad.

The Habs is the nickname given to the Montreal Canadiens, everyone's favorite hockey team! It's almost playoff time right now, so the games are on everyone's mind.

Most of the Toronto tweets were either spam, or combined with GoharShahi. Apparently, Gohar Shahi is the religious leader of RAGS International, a spiritual group. He was speaking in Toronto a week after I ran this data, and it was being advertised on Twitter as well. Sigh, more spam.

Quebec tweets were not so bad though. They were mostly related to Montreal's language police shenanigans. Basically, the OQLF is a government agency tasked with ensuring the French language laws are upheld around these parts. Unfortunately, they sometimes take it too far, like last year's infamous Pastagate incident.

And there you have it folks, one man's journey into a city's tweets. If I've learned one thing, it's this: there is absolutely no point in using #cityname in your tweets, it's just going to get buried amidst all the spam. 

Submitted by Mekkin Bjarnadottir on Tue, 2014-03-25 23:33

Better marketing starts with listening. In their article, "Marketers Want to Know What You Really Mean Online", Katherine Rosman and Elizabeth Dwoskin write about how sentiment analysis is helping marketers listen and understand conversations on social media. From the red carpet to the oval office, sentiment analysis is becoming an integral part of not only understanding what people are saying, but what they really mean. The rise of social media makes it imperative for those who want to stay relevant to join the conversation, because at the end of the day, the worst thing people can say about you is nothing at all.

Submitted by Rami Nuseir on Thu, 2014-03-20 15:18

How do your customers communicate with you? Are you aware of the things they're saying?

Seth Redmore, VP Product here at Lexalytics, wrote a simple and effective piece for tech radar explaining the benefits of text mining in today's text-heavy world. He touches on social media, emails, blog posts, and more. He makes a great case, explaining why listening to your clients is important, especially when they may give you the idea for your next big thing.

Read up on how text mining can help your business dig gold here.

Submitted by Mekkin Bjarnadottir on Tue, 2014-03-18 22:08

Lexalytics customer mBlast posted on their blog a great piece on the downfall of Klout. Put simply, popularity is out, influence is in, and influence requires context.

"I’ve said it before and I’ll say it again: you cannot determine influence without subject matter. You can’t get subject matter unless you break down an author’s posts and analyze them.  You can’t just rely on Twitter or any one Social Media stream because everyone uses them independently and even differently. And finally, unless you can put this all together and see the person’s voice across all streams, you have completely disconnected data that not only doesn’t tell you much, but can possibly even lead you in the wrong direction concerning what a person thinks or talks about." -An Obituary for Popularity

Check it out here.

Submitted by Tim Mohler on Wed, 2014-03-12 03:19

Classification is a few value proposition for text analytics - it allows users to quickly drill into articles of interest and look at trends over time. Setting up a classification scheme can be a lot of work. 

The common techniques are:

  1. Using queries to bucket documents 
  2. Using a machine learning model based on tagged document sets.

 Model-based categorization starts with humans marking content by all of the categories they satisfy. Something like "Buying kimonos while on holiday" should fit into Travel and Fashion. Generally, you need hundreds of examples per category. Then you set a machine-learning algorithm loose on the data set. The machine analyzes all of the text in each marked document and constructs some kind of signature for each category. This looks easy on the surface - deciding whether something should go in one category or another is a relatively easy decision to make. A big issue with this approach though is what they call "overfitting" where the algorithm performs really well on the exact content you fed it but sucks elsewhere. Changing its behavior is not easy, nor is it obvious WHY something got categorized this way.

Query-based categorization starts with humans trying out search terms that should define the category and seeing how their retrieval works. So, in the above taxonomy some human gets to decide the keywords appropriate for Sports, Travel, Fashion, etc. Often this is done by people looking at articles and trying to figure out which words they should choose for their queries - they might pick "kimono" and "japan" for instance in my above example. This is difficult work to do - skilled labor is required plus a lot of time. However, queries are very transparent - you can see immediately why something matched - and they are easy to change. When you realize that "kimono" is not very predictive of fashion articles in general, you can delete that word and put in something else if you like. 

There are pros and cons to each. Building queries requires a fair amount of thought and a lot of iteration, but queries are easy to change if the data changes, and the results are transparent. On the other hand, models just require users to tag documents and the machine does the heavy lift, but the results are opaque and adjusting a model for data changes requires a user to re-tag content - potentially quite a lot of content. 

Salience supports query-based classification as well as concept queries, which are model-based, even though they use keywords. As you would expect, query-based categorization takes effort to create the proper queries, while concept queries are less predictable in what they return.

Recently we worked with a customer who wanted to classify content via queries, but found the amount of time required to create the queries unacceptable. The taxonomy was not particularly large, but the pieces of content were short, and too much time was being spent finding query terms that returned only a single document, or just a few. It wasn't obvious to the query creators that these terms were "onesies" just from looking at the content. We tried using concept queries, but the content had a lot of specialized terms that were not in the concept query model.

We decided that in order to keep the query features the customer liked (transparency, ability to quickly modify) and reduce the time and effort, we would try generating query terms automatically - in a way, building a model whose interior was exposed to the user. We started with having users tag a set of documents for each node in the taxonomy, and we used Salience to retrieve n-grams from the tagged text. Next, we looked at the predictive power of each n-gram - if we just used that n-gram as a query, what would the precision and recall for that term be? Next, we looked at what the P/R numbers were when we combined terms with OR, AND, NEAR (essentially a subset of AND) and NOT. Since we had the P/R numbers for each n-gram individually, we could set a desired precision or recall threshold and only add terms that passed that threshold, or passed only in combination with another n-gram. 

The initial results from this were promising - instead of having to create queries by hand, which is hard to do, users only had to tag documents, but they received back a query that met their P/R thresholds and was user-refinable.  

Their next question was - how do I know how many documents to tag? What we did was use an increasingly large set of the tagged documents to generate queries, and looked at the maximum P/R numbers we could derive from just that set. We then plotted those to illustrate how much was gained from each additional step up in the number of documents. These numbers varied from category to category, and also were able to reveal which categories were going to be problematic. If doubling the number of tagged documents doesn't move the needle much, there's probably something inherently problematic with the category. An example of this was a category which was "all other issues." It's difficult to describe what something isn't - what isn't a chair? An elephant? Comfy socks?

Obviously, these results are content-specific. Would this technique work as well for longer-form content, or for a more generalized taxonomy? We haven't tried that yet - a content set where single n-grams are not predictive of categories would make this difficult.

In sum, generating queries from a tagged data set combines the easier task of tagging documents with the maintainability and transparency of queries.