O'Reilly logo

Python Data Analysis by Ivan Idris

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Analyzing word frequencies

The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:

punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

Create a FreqDist object and print associated keys and values with highest frequency:

fd = nltk.FreqDist(filtered)
print "Words", fd.keys()[:5]
print "Counts", fd.values()[:5]

The keys and values are printed as follows:

Words ['d', 'caesar', 'brutus', 'bru', 'haue']
Counts [215, 190, 161, 153, 148]

The first word in this list is of course not an English word, so we may need to add the heuristic ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required