Regular expressions can be a little challenging at first, but they are very powerful. They are generic abstractions, and work across multiple languages beyond Python:
import rere.split('\W+', 'Words, words, words.')> ['Words', 'words', 'words', '']
The regular expression \W+ means a word character (A-Z etc.) repeated one or more times:
words_alphanumeric = re.split('\W+', text)print(len(words_alphanumeric), len(words))
The output of the preceding code is (109111, 107431).
Let’s preview the words we extracted:
print(words_alphanumeric[90:200])
The following is the output we got from the preceding code:
['BOHEMIA', 'I', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', 'I', 'have', 'seldom', 'heard', 'him', ...