Data preparation

In this section, we will start by preprocessing the corpus for analysis and then inspecting it. We will then build the training and testing data frames.

Preprocessing and inspecting the corpus

We can see that the joint corpus contains 2,000 documents as we requested. We can now perform the steps we discussed in the preceding section. We will build a function that performs them all at once for this purpose (we will use this function again later in the chapter):

1 install.packages("SnowballC") 2 preprocess = function(corpus, stopwrds = 3 stopwords("english")){ 4 library(SnowballC) 5 corpus = tm_map(corpus, content_transformer(tolower)) 6 corpus = tm_map(corpus, removePunctuation) 7 corpus = tm_map(corpus, 8 content_transformer(removeNumbers)) ...

Get R: Predictive Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.