Yes, it's a relatively boring topic.
However, if you're serious about your sentiment engine, it's important to have a rich model for shaping and forming the text that is fed to the various extraction and measurement functions in the sentiment analysis and text mining software. In other words, if this is wrong, nothing else will be right.
Having said that, most of our customers just leave everything in Text Preparation set to default parameters, and it works just fine.
The following are the important functions that are managed by the text preparation sections of the Lexalytics Salience Engine.
- Text Validation
- String Substitution
- Part of Speech Tagging
A few important aspects to each section (to add to the above):
- Text Validation: these are basically safety functions
Too big? If you feed it a terabyte of text, it will valiantly do it's best to parse the content, but you're going to want to know how long it will take to do that single, terabyte document.
Is at least 70% of content Alphanumeric? This parameter is programmer-modifiable. We've found for long content that this is a good ratio, but if you're dealing with a lot of ugly Twitter content, you may want to change this to be a bit lower so that "OMG!!!!! I L0VE!!!!?? FRUITY!!!! LOOPS!!!!!" has a chance of parsing correctly. On the other hand, you may not. In general, 70% works just fine.
- String Substitution: generally used for dealing with consistent errors in pre-processing that generates dirty content
- Can be used to replace acronyms with fully expanded text
- Replace any tags that your HTML scraper missed
- Tokenization: Used to break the document into important chunks. It's important for us to work along things like sentence boundaries.
- Sentence boundaries: You could use the output of the tokenizer to break your document into smaller chunks for your own use, if you desire - otherwise just let it do it's thing. (So, if you wanted to do sentence-by-sentence sentiment analysis, and understand each sentence as it's own document, you could do that. You'd lose the value of the surrounding sentences/lexical chains, but you could do it.)
- Part-of-Speech Tagging: Reliable, robust tagging of parts-of-speech.
- Stemming: We use a relatively light stemming algorithm based on the Krovetz stemmer.