Text preprocessing

We start with retaining letter-only words so that numbers such as 00 and 000 and combinations of letter and number such as b8f will be removed. The filter function is defined as follows:

>>> def is_letter_only(word):...     for char in word:...         if not char.isalpha():...             return False...     return True...>>> data_cleaned = []>>> for doc in groups.data:...     doc_cleaned = ' '.join(word for word in doc.split()                                      if is_letter_only(word) )...     data_cleaned.append(doc_cleaned)

It will generate a cleaned version of the newsgroups data.

Get Python Machine Learning By Example - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.