Machine learning with Spark

At this point in the chapter, we arrived at the main task of your job: creating a model to predict one or multiple attributes being missing in the dataset. For this task, we can use some machine learning modeling, and Spark can give us a big hand in this context.

MLlib is the Spark machine learning library; although it is built in Scala and Java, its functions are also available in Python. It contains classification, regression, recommendation algorithms, some routines for dimensionality reduction and feature selection, and it has lots of functionalities for text processing. All of them are able to cope with huge datasets, and use the power of all the nodes in the cluster to achieve their goal.

As of now, it's ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.