O'Reilly logo

Hands-On Natural Language Processing with Python by Rajalingappaa Shanmugamani, Rajesh Arumugam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Tokenization

Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:

>>> import nltk>>> from nltk.corpus import twitter_samples as ts>>> ts.fileids()['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-  223406.json']>>> samples_tw = ts.strings('tweets.20150430-223406.json')>>> samples_tw[20]"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required