20 TEXT MINING

In this chapter, we introduce unstructured text as a form of data. First, we discuss a tabular representation of text data in which each column is a word, each row is a document, and each cell is a 0 or 1, indicating whether that column's word is present in that row's document. Then, we consider how to move from unstructured documents to this structured matrix. Finally, we illustrate how to integrate this process into the standard machine learning procedures covered in earlier parts of the book.

Text Mining in JMP: The Text Explorer platform in JMP is used for text mining. Some basic methods for exploring unstructured text data are available in the standard version of JMP. However, JMP Pro is required for most of the topics introduced in this chapter.

20.1 INTRODUCTION1

Up to this point, and in machine learning in general, we have been primarily dealing with three types of data: numerical, binary (true/false), and multicategory.

In some common predictive analytics applications, though, data come in text form. An Internet service provider, for example, might want to use an automated algorithm to classify support tickets as urgent or routine so that the urgent ones can receive immediate human review. A law firm facing a massive discovery process (review of large numbers of documents) would benefit from a document review algorithm that could classify documents as relevant or irrelevant. In both of these cases, the predictor attributes (features) are embedded ...

Get Machine Learning for Business Analytics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.