October 2018
Intermediate to advanced
472 pages
10h 57m
English
We will use the previously trained Natural Language Toolkit (NLTK) tokenizer (http://www.nltk.org/index.html) and stop words for the English language to clean our corpus and extract relevant unique words from the corpus. We will also create a small module to clean the provided collection, with a list of unprocessed sentences, to output the list of words:
"""**Download NLTK tokenizer models (only the first time)**"""nltk.download("punkt")nltk.download("stopwords")def sentence_to_wordlist(raw): clean = re.sub("[^a-zA-Z]"," ", raw) words = clean.split() return map(lambda x:x.lower(),words)
Since we haven't yet captured the data from the text responses in our hypothetical business use case, let's collect a good quality ...
Read now
Unlock full access