Themes: Best of All Worlds




In the opinion mining process, themes are noun phrases with contextual relevance scores.

technical info



In the opinion mining process, themes are noun phrases with contextual relevance scores. Themes extract exactly as described in noun phrase extraction. Once extracted, themes are then scored for contextual relevance using lexical chaining.











Theme Extraction and Scoring

First, potential themes are extracted based on the part-of-speech patterns. Then, the chains are scored, and themes that belong to the highest-scoring chain (sentences chained together), get the highest scores. If there are fewer than four chains, the algorithm gracefully degrades to scoring purely by count.

crazy good” and “stone cold crazy

Noun phrase extraction would provide both phrases (assuming appropriate patterns). With theme extraction, their scores would be different depending on where they occurred in the text (e.g. if they were associated with a central theme or if they were associated with a tangential thread.).

President Barack Obama did a great job with that awful oil spill.

The above sentence yields the same noun phrases as with straight noun phrase extraction (great job, awful oil spill). However, the score for each would be highly dependent on where this sentence fit in the grand scheme of things. Meaning, if there were further sentences that referenced concepts relating to oil, that will boost the score of the “oil spill” theme.

Consider the same article as from Noun Phrase Extraction:

Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix.

The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client.

Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive.

Besides the speed and performance increase, which Yahoo says were the top users requests, the company has added a very robust Twitter client, which joins the existing social-sharing tools for Facebook and Yahoo. You can post to just Twitter, or any combination of the other two services, as well as see Twitter status updates in the update stream below. Yahoo has long had a way to slurp in Twitter feeds, but now you can do things like reply and retweet without leaving the page.

If asynchronous updates are not your thing, Yahoo has also tuned its integrated IM service to include some desktop software-like features, including window docking and tabbed conversations. This lets you keep a chat with several people running in one window while you go about with other e-mail tasks.

--Source: CNN

In this case, the top 5 themes are:

Theme Score
Cloud computing technology 4.11
Including window docking 2.976
Mail service 2.672
Top users requests 2.66
Rich media previews 2.635

You can see that those themes do a reasonable job of conveying the actual context of the article. The addition of contextual scoring information is hugely useful in determining what’s really important in the text, and is useful to compare across many articles across periods of time (to see what’s emerging, etc).

Specific to our text analytics software at Lexalytics, themes also carry the advantage of being scored for sentiment. This is particularly important when considering a case like the President Obama sentence where its important to be able to distinguish between the positive perception of the President and the negative perception of the theme “oil spill”.

Advantages to Theme Extraction and Scoring

  • Restricts to phrases matching certain part of speech patterns, more wheat from the chaff
  • Scored based on contextual importance
  • Sentiment analysis scores for themes


  • Limited to words in the text (true for all algorithms)