Salience is roughly split into two parts: the software itself, and the supporting data directories that have dictionaries, models, etc. This post discusses work specific to the data directory itself.
With the use of emoticons, abbreviations, and poor grammar; Micro-blog services, like Twitter, present a difficult task for natural language processing systems. We spent the summer teaching our software how to deal better with such content. We did pretty well before, but the improvements we made make processing this content significantly more useful.
We parsed thousands of tweets to get the common acronyms, and then made decisions on whether they were sentiment-bearing, acronyms to be expanded, or to be treated (from a part of speech perspective) as simply an interjection.
LOL: (Laugh Out Loud) does not carry sentiment, nor does expanding it add any value to the resulting lexical processing; so we treat it as an interjection
FTW: (for the win) carries positive sentiment
IDK: (I don't know) is useful when expanded out to its individual words
Some are obviously positive or negative, and others we use to push the sentiment towards neutral – not because they’re necessarily inherently neutral, but because they are used in a context that’s probably going to mess up sentiment analysis:
And faces like :P are used to push the content towards neutral because of the reasons above.
3) @ sign
We part-of-speech tag @ symbols as /MENTION - so that you can use this for further processing and reporting. In particular, calls for Entities will return @ tagged strings as people entities, with the associated sentiment, themes, etc.
4) # sign
We part-of-speech tag # symbols as /HASHTAG
The combination of all of these allow for much richer reporting on the conversations that are occurring around, about, and between different accounts on Twitter.
We did some other data cleanup work as well, most significantly in the "famous people" data file, adding more people and adding a wikipedia link to that person.