July 2018
Beginner to intermediate
406 pages
9h 55m
English
We have already noticed one thing—real data is noisy. The newsgroup dataset is no exception. It even contains invalid characters that will result in UnicodeDecodeError.
We have to tell the vectorizer to ignore them:
>>> vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5,
... stop_words='english', decode_error='ignore')
>>> vectorized = vectorizer.fit_transform(train_data.data)
>>> num_samples, num_features = vectorized.shape
>>> print("#samples: %d, #features: %d" % (num_samples, num_features))
#samples: 3529, #features: 4712
We now have a pool of 3529 posts and, extracted for each of them, a feature vector of 4712 dimensions. That is what K-means takes as input. We will fix the cluster size to 50 for this chapter, ...
Read now
Unlock full access