Bigram Analysis
As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when youâre able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are âopenâ, âsourceâ, and âgovernmentâ, could you necessarily say that the text is probably about âopen sourceâ, âopen governmentâ, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, youâd need to go back to the text and somehow determine which of the words most frequently occur after âopenââi.e., youâd like to find the collocations that start with the token âopenâ.
Recall from Chapter 6 that an
n-gram is just a terse way of expressing each
possible consecutive sequence of n tokens from a
text, and it provides the foundational data structure for computing
collocations. There are always (n-1)
n-grams for any value of n, and
if you were to consider all of the bigrams (2-grams) for the sequence of
tokens ["Mr.", "Green", "killed", "Colonel",
"Mustard"]
, youâd have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed", "Colonel"), ...
Get Mining the Social Web now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.