A complete data science example – text classification

Now, here's a complete example that allows you to put each text in the right category. We will use the 20newsgroup dataset, which was already introduced in Chapter 1, First Steps. To make things more realistic and prevent the classifier from overfitting the data, we'll remove email headers, footers (such as a signature), and quotes. In addition, in this case, the goal is to classify between two similar categories: sci.med and sci.space. We will use the accuracy measure to evaluate the classification:

In: import nltk    from sklearn.datasets import fetch_20newsgroups    from sklearn.feature_extraction.text import TfidfVectorizer    from sklearn.linear_model import SGDClassifier    from sklearn.metrics ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.