Word and sentence tokenization
We have dealt with word tokenization previously, but we can perform this using NLTK as well as sentence tokenization, which is quite tricky, as the English language has period symbols for abbreviations and other purposes. Thankfully, the sentence tokenizer is a instance of PunktSentenceTokenizer from the
tokenize.punkt module of
nltk, which helps in tokenizing sentences.
Let's look at word tokenization using this code:
>>> #Loading the forbes data >>> data = open('./Data/madmax_review/forbes.txt','r').read() >>> word_data = nltk.word_tokenize(data) >>> word_data[:15] ['Pundits', 'and', 'critics', 'like', 'to', 'blame', 'the', 'twin', 'successes', 'of', 'Jaws', 'and', 'Star', 'Wars', 'for']
Now, let's ...