Normalization

In order to carry out processing on natural language text, we need to perform normalization that mainly involves eliminating punctuation, converting the entire text into lowercase or uppercase, converting numbers into words, expanding abbreviations, canonicalization of text, and so on.

Eliminating punctuation

Sometimes, while tokenizing, it is desirable to remove punctuation. Removal of punctuation is considered one of the primary tasks while doing normalization in NLTK.

Consider the following example:

>>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."] >>> from nltk.tokenize import word_tokenize >>> tokenized_docs=[word_tokenize(doc) for doc in text] >>> print(tokenized_docs) [['It', ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Natural Language Processing: Python and NLTK by Nitin Hardeniya, Jacob Perkins, Deepti Chopra, Nisheeth Joshi, Iti Mathur

Normalization

Eliminating punctuation

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly