Chapter 6. Introducing the ML Package

In the previous chapter, we worked with the MLlib package in Spark that operated strictly on RDDs. In this chapter, we move to the ML part of Spark that operates strictly on DataFrames. Also, according to the Spark documentation, the primary machine learning API for Spark is now the DataFrame-based set of models contained in the package.

So, let's get to it!


In this chapter, we will reuse a portion of the dataset we played within the previous chapter. The data can be downloaded from

In this chapter, you will learn how to do the following:

  • Prepare transformers, estimators, and pipelines
  • Predict the chances of infant survival using ...

Get Learning PySpark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.