Notice that complaint data is in unstructured format. Certain text mining algorithms treat unstructured text as a bag of words, which means that in analyzing documents, one disregards semantics and grammar, and ends up treating each word as its own feature or variable.
An important data structure in text mining is a term document matrix (TDM), which simply indicates which words appear in each document.
The create_matrix() function will do this for us. However, before we do this, we will want to clean the data and impose some restrictions on the creation of the term document matrix.
First, we do not want to include any words, such as the, an, or it, that would not add any value to the TDM and would ...