28Implementation of Tokenization in Natural Language Processing Using NLTK Module of Python

Vikash Kumar Mishra¹, Abhimanyu Dhyani²*, Sushree Barik³ and Tanish Gupta³

¹Galgotias University, Gautam Buddha Nagar, Uttar Pradesh, India

²Indian Institute of Technology, Jodhpur, Rajsthan, India

³Christ (Deemed to be) University, Delhi NCR, India

Abstract

With the advancement of technologies, now it is possible to analyze the large amount of unstructured text circulated online with various tools and methods for understanding the changes as well to infer meaningful insights from the text data. In this work, the aim is to understand how Python can be used for text analytics by the help of various libraries available in it. The natural language processing (NLP) is being used to analyze and synthesize natural language and speech in Python.

Keywords: Natural language processing (NLP), natural language toolkit (NLTK), tokenization

28.1 Introduction

According to one estimate, only a small fraction of today’s data is structured. This includes everyday communication like speaking, tweeting, and sending messages through various platforms, such as WhatsApp, email, Facebook, Instagram, and text. For this data, the most common format is text, which is deeply unstructured. We must analyze the text data in order to gain meaningful insights. It can be done manually with one person and an excel spreadsheet but at large scale, this can be time-consuming, inefficient and inaccurate.

In order to be able ...

Get Mathematics and Computer Science, Volume 1 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mathematics and Computer Science, Volume 1 by Sharmistha Ghosh, M. Niranjanamurthy, Krishanu Deyasi, Biswadip Basu Mallik, Santanu Das

28Implementation of Tokenization in Natural Language Processing Using NLTK Module of Python

28.1 Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly