Unstructured data, of which text data is a major part, is one of the three major sources for the data volume explosion that has occurred in the last dozen years.1 Nearly all of your communication is now in a digital format from email to tweets and blogs.2 I was even able recently to change my phone plan, purchase a different high-speed Internet service, and correct a billing problem all via instant message chat session. Even when the communication happens via phone, it is likely being converted to a text format for storage and further potential analysis.
Text data, no matter the origin, presents challenges to process it and convert it from a raw form to make it suitable for modeling.
Working with text, or unstructured data, is one of the two major reasons that this area offers such a competitive advantage in many markets. It is sometimes very difficult to manipulate the data into a usable form. The recommended steps for getting data ready include:
Information retrieval is a needed action in almost ...