Learning from text – Naive Bayes for Natural Language Processing

In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices.

We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from Impermium, released during a Kaggle competition (see http://www.kaggle.com/c/detecting-insults-in-social-commentary).

How to do it...

  1. Let's import our libraries:
    >>> import numpy as np import pandas as pd import sklearn import sklearn.model_selection as ms import sklearn.feature_extraction.text as text import sklearn.naive_bayes ...

Get IPython Interactive Computing and Visualization Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.