We start with retaining letter-only words so that numbers such as 00 and 000 and combinations of letter and number such as b8f will be removed. The filter function is defined as follows:
>>> def is_letter_only(word):... for char in word:... if not char.isalpha():... return False... return True...>>> data_cleaned = []>>> for doc in groups.data:... doc_cleaned = ' '.join(word for word in doc.split() if is_letter_only(word) )... data_cleaned.append(doc_cleaned)
It will generate a cleaned version of the newsgroups data.