Chapter 9. Case Study: Analyzing Usenet Text
In our final chapter, we’ll use what we’ve learned in this book to perform a start-to-finish analysis of a set of 20,000 messages sent to 20 Usenet bulletin boards in 1993. The Usenet bulletin boards in this dataset include newsgroups for topics like politics, religion, cars, sports, and cryptography, and offer a rich set of text written by many users. This data set is publicly available at http://qwone.com/~jason/20Newsgroups/ (the 20news-bydate.tar.gz file) and has become popular for exercises in text analysis and machine learning.
Preprocessing
We’ll start by reading in all the messages from the 20news-bydate
folder, which are organized in subfolders with one file for each
message. We can read in files like these with a combination of
read_lines(), map(), and unnest().
Warning
Note that this step may take several minutes to read all the documents.
library(dplyr)library(tidyr)library(purrr)library(readr)
training_folder<-"data/20news-bydate/20news-bydate-train/"# Define a function to read all files from a folder into a data frameread_folder<-function(infolder){data_frame(file=dir(infolder,full.names=TRUE))%>%mutate(text=map(file,read_lines))%>%transmute(id=basename(file),text)%>%unnest(text)}# Use unnest() and map() to apply read_folder to each subfolderraw_text<-data_frame(folder=dir(training_folder,full.names=TRUE))%>%unnest(map(folder,read_folder))%>%transmute(newsgroup=basename(
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access