How to do it...

Here is how we go about preprocessing:

  1. Load the required packages:
load_packages=c("janeaustenr","tidytext","dplyr","stringr","ggplot2","wordcloud","reshape2","igraph","ggraph","widyr","tidyr") 
lapply(load_packages, require, character.only = TRUE) 
  1. Load the Pride and Prejudice dataset. The line_num attribute is analogous to the line number printed in the book:
Pride_Prejudice <- data.frame("text" = prideprejudice, 
                              "book" = "Pride and Prejudice", 
                              "line_num" = 1:length(prideprejudice), 
                              stringsAsFactors=F) 
  1. Now, perform tokenization to restructure the one-string-per-row format to a one-token-per-row format. Here, the token can refer to a single word, a group of characters, co-occurring words (n-grams), sentences, paragraphs, ...

Get R Deep Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.