May 2019
Intermediate to advanced
664 pages
15h 41m
English
To get rid of stop words in a tidy format, you can use the stop_words data frame provided in the tidytext package. You call that tibble into the environment, then do an anti-join by word:
> library(tidytext)> data(stop_words)> sotu_tidy <- sotu_unnest %>% dplyr::anti_join(stop_words, by = "word")
Notice that the length of the data went from 1.97 million observations down to 778,161. Now, you can go ahead and see the top words. I don't do it in the following, but you can put this into a data frame if you so choose:
> sotu_tidy %>% dplyr::count(word, sort = TRUE)# A tibble: 29,558 x 2 word n <chr> <int> 1 government 7573 2 congress 5759 3 united 5102 4 people 4219 5 country 3564 6 public 3413 7 time 3138 8 war ...