© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. TestasDistributed Machine Learning with PySparkhttps://doi.org/10.1007/978-1-4842-9751-3_14

14. Natural Language Processing with Pandas, Scikit-Learn, and PySpark

Abdelaziz Testas1  
(1)
Fremont, CA, USA
 

In this chapter, we move to a new area of machine learning, namely, that of processing text data and applying an algorithm to it. This area of machine learning is known as natural language processing (NLP), which finds uses in many business applications including speech recognition, chatbots, language translation, and email spam detection (ham or spam).

The project of this chapter is to examine the key steps involved in processing text data using an open ...

Get Distributed Machine Learning with PySpark: Migrating Effortlessly from Pandas and Scikit-Learn now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.