Converting unstructured to structured data

Notice that complaint data is in unstructured format. Certain text mining algorithms treat unstructured text as a bag of words, which means that in analyzing documents, one disregards semantics and grammar, and ends up treating each word as its own feature or variable.

An important data structure in text mining is a term document matrix (TDM), which simply indicates which words appear in each document.

The create_matrix() function will do this for us. However, before we do this, we will want to clean the data and impose some restrictions on the creation of the term document matrix.

First, we do not want to include any words, such as the, an, or it, that would not add any value to the TDM and would ...

Get Practical Predictive Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.