Text analytics is not a perfect art, and very occasionally it fails so spectacularly that we feel we really need to start sharing these rare moments with you. Thus we present to you our first installment in “Text Analytics Fails”. Enjoy!
Here we were, categorizing hotel reviews into lifestyle categories such as “romantic” and “trendy.” Our categorization model for this particular set of data told us that we could expect “weekend away” and “romantic weekend” to be good clues that the review was about romance. Our favorite fail so far on this data set was the following review, categorized by us as “likely romantic”:
“Okay, so the reviews on here may not be the best, however for you pay this is a decent place to lay your head. Its clean and decent . No frills and the location could not be better. Ideal for a lads weekend away. Take your partner for a romantic weekend and expect a divorce.”
This is a good example of both negation (inferred by the divorce comment) and the importance of context (“lads weekend away” is not the same as “weekend away” by itself). Adding something like “NOT divorce” to the categorizer likely doesn’t help a lot since this is both a one-off, and there are plenty of mentions of divorce along with romance – “I’m a divorced single and looking for a romantic place to take my new interest.”
If you’re interested in how consistency can matter more than perfect accuracy, check out Craig Golightly’s talk from 2013’s Lexalytics User Group.