Data preparation

In this section, we will start by preprocessing the corpus for analysis and then inspecting it. We will then build the training and testing data frames.

Preprocessing and inspecting the corpus

We can see that the joint corpus contains 2,000 documents as we requested. We can now perform the steps we discussed in the preceding section. We will build a function that performs them all at once for this purpose (we will use this function again later in the chapter):

1 install.packages("SnowballC") 2 preprocess = function(corpus, stopwrds = 3 stopwords("english")){ 4 library(SnowballC) 5 corpus = tm_map(corpus, content_transformer(tolower)) 6 corpus = tm_map(corpus, removePunctuation) 7 corpus = tm_map(corpus, 8 content_transformer(removeNumbers)) ...

Get R: Predictive Analysis now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.