3Text Generation & Classification in NLP: A Review

Kuldeep Vayadande1*, Dattatray Raghunath Kale2, Jagannath Nalavade2, R. Kumar3 and Hanmant D. Magar4

1Vishwakarma Institute of Technology, Pune, Maharashtra, India

2MIT Art Design and Technology University, Maharashtra, Pune, India

3VIT-AP University, Inavolu, Beside AP Secretariat, Amaravati AP, India

4Vishwakarma Institute of Information Technology, Kondhawa, Pune, Maharashtra, India

Abstract

The initial stage in natural language processing is to break down the text into separate tokens. When the text corpus is huge, covering all words is inefficient regarding size of vocabulary. The effectiveness of a specific tokenization method varies on various factors, such as size of the dataset, the nature of the task, and the morphological complexity of the dataset. By comparing the algorithms, it can be concluded that no tokenization technique is the best choice. In this survey, various applications are being surveyed and the comparison of these various algorithms is done by estimating them on classification tasks like sentiment analysis. Question answering and translation applications use the available datasets. This survey paper also shows the tokenization based on the noisy text data and how various tokenization algorithm works on these data are being compared, and what is the average number of segmented subword accuracy being discussed. Basically, sentiment analysis studies the information in an expression and classifies them ...

Get How Machine Learning is Innovating Today's World now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.