Normalization

In order to carry out processing on natural language text, we need to perform normalization that mainly involves eliminating punctuation, converting the entire text into lowercase or uppercase, converting numbers into words, expanding abbreviations, canonicalization of text, and so on.

Eliminating punctuation

Sometimes, while tokenizing, it is desirable to remove punctuation. Removal of punctuation is considered one of the primary tasks while doing normalization in NLTK.

Consider the following example:

>>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."] >>> from nltk.tokenize import word_tokenize >>> tokenized_docs=[word_tokenize(doc) for doc in text] >>> print(tokenized_docs) [['It', ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.