Text preprocessing

We start with retaining letter-only words so that numbers such as 00 and 000 and combinations of letter and number such as b8f will be removed. The filter function is defined as follows:

>>> def is_letter_only(word):...     for char in word:...         if not char.isalpha():...             return False...     return True...>>> data_cleaned = []>>> for doc in groups.data:...     doc_cleaned = ' '.join(word for word in doc.split()                                      if is_letter_only(word) )...     data_cleaned.append(doc_cleaned)

It will generate a cleaned version of the newsgroups data.

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.