CHAPTER 20 Text Mining

In this chapter, we introduce text as a form of data. First, we discuss a tabular representation of text data in which each column is a word, each row is a document, and each cell is a 0 or 1, indicating whether that column’s word is present in that row’s document. Then we consider how to move from unstructured documents to this structured matrix. Finally, we illustrate how to integrate this process into the standard data mining procedures covered in earlier parts of the book.

Python

In this chapter, we will use pandas for data handling and scikit-learn for the feature creation and model building. The Natural Language Toolkit will be used for more advanced text processing (nltk: https://www.nltk.org).

 import required functionality for this chapter

from zipfile import ZipFile import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS from sklearn.decomposition import TruncatedSVD from sklearn.preprocessing import Normalizer from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression import nltk from nltk import word_tokenize from nltk.stem.snowball import EnglishStemmer import matplotlib.pylab as plt from dmba import ...

Get Data Mining for Business Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.