January 2019
Intermediate to advanced
378 pages
8h 27m
English
Let's start by creating a function we can use to examine the most common tuples. We'll set it up so that we can use it later on the body text as well. We'll do this using the Python Natural Language Toolkit (NLTK) library. This can be pip installed if you don't have it currently:
from nltk.util import ngrams from nltk.corpus import stopwords import re def get_word_stats(txt_series, n, rem_stops=False): txt_words = [] txt_len = [] for w in txt_series: if w is not None: if rem_stops == False: word_list = [x for x in ngrams(re.findall('[a-z0-9\']+', w.lower()), n)] else: word_list = [y for y in ngrams([x for x in re.findall('[a-z0-9\']+', w.lower())\ if x not in stopwords.words('english')], n)] word_list_len = len(list(word_list)) ...Read now
Unlock full access