28Implementation of Tokenization in Natural Language Processing Using NLTK Module of Python
Vikash Kumar Mishra1, Abhimanyu Dhyani2*, Sushree Barik3 and Tanish Gupta3
1Galgotias University, Gautam Buddha Nagar, Uttar Pradesh, India
2Indian Institute of Technology, Jodhpur, Rajsthan, India
3Christ (Deemed to be) University, Delhi NCR, India
Abstract
With the advancement of technologies, now it is possible to analyze the large amount of unstructured text circulated online with various tools and methods for understanding the changes as well to infer meaningful insights from the text data. In this work, the aim is to understand how Python can be used for text analytics by the help of various libraries available in it. The natural language processing (NLP) is being used to analyze and synthesize natural language and speech in Python.
Keywords: Natural language processing (NLP), natural language toolkit (NLTK), tokenization
28.1 Introduction
According to one estimate, only a small fraction of today’s data is structured. This includes everyday communication like speaking, tweeting, and sending messages through various platforms, such as WhatsApp, email, Facebook, Instagram, and text. For this data, the most common format is text, which is deeply unstructured. We must analyze the text data in order to gain meaningful insights. It can be done manually with one person and an excel spreadsheet but at large scale, this can be time-consuming, inefficient and inaccurate.
In order to be able ...
Get Mathematics and Computer Science, Volume 1 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.