Chapter 4. Relationships Between Words: N-grams and Correlations
So far we’ve considered words as individual units, and considered their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or words that tend to co-occur within the same documents.
In this chapter, we’ll explore some of the methods tidytext offers for
calculating and visualizing relationships between words in your text
dataset. This includes the token = "ngrams" argument, which tokenizes
by pairs of adjacent words rather than by individual ones. We’ll also
introduce two new packages: ggraph, by Thomas Pedersen,
which extends ggplot2 to construct network plots, and
widyr, which calculates pairwise
correlations and distances within a tidy data frame. Together these
expand our toolbox for exploring text within the tidy data framework.
Tokenizing by N-gram
We’ve been using the unnest_tokens function to tokenize by word, or
sometimes by sentence, which is useful for the kinds of sentiment and
frequency analyses we’ve been doing so far. But we can also use the
function to tokenize into consecutive sequences of words, called
n-grams. By seeing how often word X is followed by word Y, we can then
build a model of the relationships between them.
We do this by adding the token = "ngrams" option to unnest_tokens(),
and setting n to the number of words we wish to capture in each n-gram. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access