Building a text classifier

The goal of text classification is to categorize text documents into different classes. This is an extremely important analysis technique in NLP. We will use a technique, which is based on a statistic called tf-idf, which stands for term frequency—inverse document frequency. This is an analysis tool that helps us understand how important a word is to a document in a set of documents. This serves as a feature vector that's used to categorize documents. You can learn more about it at http://www.tfidf.com.

How to do it…

  1. Create a new Python file, and import the following package:
    from sklearn.datasets import fetch_20newsgroups
  2. Let's select a list of categories and name them using a dictionary mapping. These categories are available ...

Get Python: Real World Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.