May 2017
Intermediate to advanced
310 pages
8h 5m
English
The scikit module comes with a number of sample data we will use for training our model. In this case, we will use the newsgroups posts. To load the posts, we will use the following lines of code:
from sklearn.datasets import fetch_20newsgroups training_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
After we have trained our model, the results of a prediction must belong to one of the following categories:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
The number of records we are going to use as training data is obtained by the following:
print(len(training_data))
Machine learning algorithms do not mix well with textual attributes ...