Bigram Analysis

As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open”, “source”, and “government”, could you necessarily say that the text is probably about “open source”, “open government”, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the words most frequently occur after “open”—i.e., you’d like to find the collocations that start with the token “open”.

Recall from Chapter 6 that an n-gram is just a terse way of expressing each possible consecutive sequence of n tokens from a text, and it provides the foundational data structure for computing collocations. There are always (n-1) n-grams for any value of n, and if you were to consider all of the bigrams (2-grams) for the sequence of tokens ["Mr.", "Green", "killed", "Colonel", "Mustard"], you’d have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed", "Colonel"), ...

Get Mining the Social Web now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.