Chapter 9. Case Study: Analyzing Usenet Text

In our final chapter, we’ll use what we’ve learned in this book to perform a start-to-finish analysis of a set of 20,000 messages sent to 20 Usenet bulletin boards in 1993. The Usenet bulletin boards in this dataset include newsgroups for topics like politics, religion, cars, sports, and cryptography, and offer a rich set of text written by many users. This data set is publicly available at http://qwone.com/~jason/20Newsgroups/ (the 20news-bydate.tar.gz file) and has become popular for exercises in text analysis and machine learning.

Preprocessing

We’ll start by reading in all the messages from the 20news-bydate folder, which are organized in subfolders with one file for each message. We can read in files like these with a combination of read_lines(), map(), and unnest().

Warning

Note that this step may take several minutes to read all the documents.

library(dplyr)
library(tidyr)
library(purrr)
library(readr)
training_folder <- "data/20news-bydate/20news-bydate-train/"

# Define a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
  data_frame(file = dir(infolder, full.names = TRUE)) %>%
    mutate(text = map(file, read_lines)) %>%
    transmute(id = basename(file), text) %>%
    unnest(text)
}

# Use unnest() and map() to apply read_folder to each subfolder
raw_text <- data_frame(folder = dir(training_folder, full.names = TRUE)) %>%
  unnest(map(folder, read_folder)) %>%
  transmute(newsgroup = basename(

Get Text Mining with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.