Extracting N-grams
In standard quantitative analysis of text, N-grams are sequences of N tokens (for example, words or characters). For instance, given the text The quick brown fox jumped over the lazy dog, if our tokens are words, then the 1-grams are the, quick, brown, fox, jumped, over, the, lazy, and dog. The 2-grams are the quick, quick brown, brown fox, and so on. The 3-grams are the quick brown, quick brown fox, brown fox jumped, and so on. Just like the local statistics of the text allowed us to build a Markov chain to perform statistical predictions and text generation from a corpus, N-grams allow us to model the local statistical properties of our corpus. Our ultimate goal is to utilize the counts of N-grams to help us predict whether ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access