Determining Context: Clustering, N-grams, Part-of-Speech-Based Extraction, Themes, and Facets
The “bread and butter” text analytics operations are: “who is being discussed”, “what is the context of the conversation”, and “what is the tone/sentiment of that conversation”. This paper is specifically about those text analysis techniques available for determining context.
It turns out that nouns and noun phrases are the most generically useful parts of speech with which to determine context. To be even more specific, it’s those nouns that you’re not reaching through named entity recognition and extraction. Named entity extraction is a sentiment analysis process that deals (roughly speaking) with proper nouns. We’re considering those named entities that, largely, are not proper nouns. There may be proper nouns that are picked up as part of the “noun analysis” that you’re doing, but that is because they were not identified as a named entity.
Consider the following sentence. It’s politically controversial, but gives a good example of how important it is to separately recognize the named entities and the context.
“President Barack Obama did a great job with that awful oil spill.”
Named entity recognition and extraction will give you “President Barack Obama” as a person. Sentiment analysis will note a positive sentiment pointed back towards the person “President Barack Obama”. However, without understanding the additional nouns, you’ll have no idea of the context in which President Barak Obama is receiving praise.
And so, other than a vague positive sentiment, you don’t really know anything; as opposed to knowing that some author (or someone being quoted by some author) is giving thumbs up to President Barak Obama’s mad oil spill handling skillz.
Extracting non-entity phrases is an excellent next step to greater understanding of the content.
We're going to talk about five computational techniques for extracting these contextual phrases: clustering, N-grams, noun-phrase extraction, themes, and facets.