Mastering Large Datasets with Python

Chapter 10. Faster decision-making with machine learning and PySpark

This chapter covers

An introduction to machine learning
Training and applying decision tree classifiers in parallel with PySpark
Matching problems and appropriate machine learning algorithms
Training and applying random forest regressors with PySpark

Chapter 9 showed how we can write Python and take advantage of Spark, one of the most popular distributed computing frameworks. We saw some of Spark’s raw data transformation options, and we used Spark in the map and reduce style we’ve been exploring throughout the book. However, one of the reasons why Spark is so popular is its built-in machine learning capabilities.

Machine learning refers to the design, training, application, ...

Get Mastering Large Datasets with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Large Datasets with Python by John Wolohan

Chapter 10. Faster decision-making with machine learning and PySpark

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly