N-Grams

N-Grams

N-grams are combinations of 1 or more words.  1: monogram, 2: bi-gram, 3: tri-gram, etc.  Rarely is it more than 3, unless looking for a specific slogan or turn of phrase.  Words are not taken from any part of speech class, so you’re going to get any and all strings.  This is important because often times you can simply filter out all words of a particular part of speech class (nouns, verbs, adjectives, adverbs, etc) to improve your signal-to-noise ratio.

Monograms vs. bi-grams vs. tri-grams

Consider these phrases:  “crazy good” and “stone cold crazy” (as well as the original phrase "President Barack Obama did a great job with that awful oil spill".

 

Mono-grams

Bi-grams

Tri-grams

Phrases Extracted (crazy good, stone cold crazy)

crazy (2)

cold

good

stone

crazy good

cold crazy

stone cold

stone cold crazy

Phrases Extracted

(President Obama)

a

awful

barack

did

great

job

obama

oil

president

spill

that

with

a great

awful oil

barack obama

did a

great job

job with

obama did

oil spill

president barack

that awful

with that

 

a great job

awful oil spill

barack obama did

did a great

great job with

job with that

obama did a

president barack obama

that awful oil

with that awful

 

Results:

Not specific enough

Just right

Very specific, misses important phrase

 

Generally not used for “phrase extraction” good for other things

Most often used

Used, gives very specific phrases

N-grams and stop words

The biggest problem with n-grams as phrase extraction is that it is a promiscuous algorithm. 

Stop words let you make a list of terms to exclude from analysis.  Classic stop words are things like: a, an, the, of, for, and…  In addition to these very common examples, each domain has a set of words that are statistically too common to be interesting.

With most stop lists, all of the words “crazy, good, stone, cold” would probably make it through.  Unless, perhaps, you were working on data for the “Cold Stone Creamery” (for those not in the USA, that’s an ice cream parlo(u)r. ), and you’d stopped the words in your name. 

Now, it’s important to note that if you “stopped” the phrase “cold stone creamery” that’s very different than stopping “cold”, “stone”, and “creamery”, as follows:

In the “cold stone creamery” case, if you got the phrase “cold as a fish”, that phrase would make it through and be decomposed into n-grams as appropriate.

In the “cold”, “stone”, and “creamery” case, if you got the phrase “cold as a fish”, that phrase would be chopped down to just “fish” (as most stop lists will also have the words “as” and “a” in them along with “cold”, “stone”, and “creamery”.

N-gram stop words generally stop entire phrases in which they appear.  For example, the phrase “for example” would be stopped if the word “for” was in the stop list (which it generally would be).  For the case “cold as a fish”, that phrase would be completely stopped out, as “cold fish” is not the relevant phrase.

Advantages of N-grams

  • You’ll catch everything that you don’t stop out, without any regard to parts of speech or
    anything else
  • Computationally simple, easy to conceptually understand

Disadvantages

  • Promiscuous:  requires long list of stop words to be interesting
  • Simple count does not necessarily give an indication of “importance” to text or of
    its importance to an entity
  • Limited to words that appear in the text

demos

Instant demo.
No sales call necessary.




resources

Download whitepapers, datasheets, videos, get to support, you name it, it's there.



contact us

Ask us anything.