N-grams are combinations of 1 or more words. 1: monogram, 2: bi-gram, 3: tri-gram, etc. Rarely is it more than 3, unless looking for a specific slogan or turn of phrase. Words are not taken from any part of speech class, so you’re going to get any and all strings. This is important because often times you can simply filter out all words of a particular part of speech class (nouns, verbs, adjectives, adverbs, etc) to improve your signal-to-noise ratio.
Monograms vs. bi-grams vs. tri-grams
Consider these phrases: “crazy good” and “stone cold crazy” (as well as the original phrase "President Barack Obama did a great job with that awful oil spill".
N-grams and stop words
The biggest problem with n-grams as phrase extraction is that it is a promiscuous algorithm.
Stop words let you make a list of terms to exclude from analysis. Classic stop words are things like: a, an, the, of, for, and… In addition to these very common examples, each domain has a set of words that are statistically too common to be interesting.
With most stop lists, all of the words “crazy, good, stone, cold” would probably make it through. Unless, perhaps, you were working on data for the “Cold Stone Creamery” (for those not in the USA, that’s an ice cream parlo(u)r. ), and you’d stopped the words in your name.
Now, it’s important to note that if you “stopped” the phrase “cold stone creamery” that’s very different than stopping “cold”, “stone”, and “creamery”, as follows:
In the “cold stone creamery” case, if you got the phrase “cold as a fish”, that phrase would make it through and be decomposed into n-grams as appropriate.
In the “cold”, “stone”, and “creamery” case, if you got the phrase “cold as a fish”, that phrase would be chopped down to just “fish” (as most stop lists will also have the words “as” and “a” in them along with “cold”, “stone”, and “creamery”.
N-gram stop words generally stop entire phrases in which they appear. For example, the phrase “for example” would be stopped if the word “for” was in the stop list (which it generally would be). For the case “cold as a fish”, that phrase would be completely stopped out, as “cold fish” is not the relevant phrase.
Advantages of N-grams