CHAPTER 10 Text Analytics

Unstructured data, of which text data is a major part, is one of the three major sources for the data volume explosion that has occurred in the last dozen years.1 Nearly all of your communication is now in a digital format from email to tweets and blogs.2 I was even able recently to change my phone plan, purchase a different high-speed Internet service, and correct a billing problem all via instant message chat session. Even when the communication happens via phone, it is likely being converted to a text format for storage and further potential analysis.

Text data, no matter the origin, presents challenges to process it and convert it from a raw form to make it suitable for modeling.

Working with text, or unstructured data, is one of the two major reasons that this area offers such a competitive advantage in many markets. It is sometimes very difficult to manipulate the data into a usable form. The recommended steps for getting data ready include:

  • Identify the problem you are trying to solve. This may sound simple, but many projects fail because they do not have a clear scope and outcome and then they drift out of control until they are never heard from again.
  • Identify the potential sources of your data. If this is purely text analytics, then sources will be unstructured. If this is predictive modeling, then this will likely include both structured and unstructured sources.

INFORMATION RETRIEVAL

Information retrieval is a needed action in almost ...

Get Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.