O'Reilly logo

Hands-On Natural Language Processing with Python by Rajalingappaa Shanmugamani, Rajesh Arumugam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Exploratory analysis of text

Once we have the tokenized data, one of the basic analyses that is commonly performed is counting words or tokens and their distributions in the document. This will enable us to know more about the main topics in the document. Let's start by analyzing the web text data that comes with NLTK:

>>> import nltk>>> from nltk.corpus import webtext>>> webtext_sentences = webtext.sents('firefox.txt')>>> webtext_words = webtext.words('firefox.txt')>>> len(webtext_sentences)1142>>> len(webtext_words)102457

Note that we have only loaded the text related to the Firefox discussion forum (firefox.txt), though the web text data has other data, as well (like advertisements and movie script text). The preceding code output gives ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required