Here is how we go about preprocessing:
- Load the required packages:
load_packages=c("janeaustenr","tidytext","dplyr","stringr","ggplot2","wordcloud","reshape2","igraph","ggraph","widyr","tidyr") lapply(load_packages, require, character.only = TRUE)
- Load the Pride and Prejudice dataset. The line_num attribute is analogous to the line number printed in the book:
Pride_Prejudice <- data.frame("text" = prideprejudice, "book" = "Pride and Prejudice", "line_num" = 1:length(prideprejudice), stringsAsFactors=F)
- Now, perform tokenization to restructure the one-string-per-row format to a one-token-per-row format. Here, the token can refer to a single word, a group of characters, co-occurring words (n-grams), sentences, paragraphs, ...