July 2018
Beginner to intermediate
312 pages
8h 31m
English
Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:
>>> import nltk>>> from nltk.corpus import twitter_samples as ts>>> ts.fileids()['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430- 223406.json']>>> samples_tw = ts.strings('tweets.20150430-223406.json')>>> samples_tw[20]"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In ...Read now
Unlock full access