Before jumping to the model-building part, let's clean the input data:
- First, we need to create a custom function, clean_data(), in order to convert the messy data into a cleaned dataset. We will apply this function to both the reviews and the associated summaries and then put the cleaned versions into a DataFrame for easy data manipulation:
clean_data <- function(data,remove_stopwords = TRUE){ data <- tolower(data) data = replace_contraction(data) data = gsub('<br />', '', data) data = gsub('[[:punct:] ]+',' ',data) data = gsub("[^[:alnum:]\\-\\.\\s]", " ", data) data = gsub('&', '', data) data = if(remove_stopwords == "TRUE"){paste0(unlist(rm_stopwords(data,tm::stopwords("english"))),collapse = " ")}else{data} data ...