Chapter 11. Managing the Machine Learning Lifecycle with MLflow

As machine learning gains prominence across industries and is deployed in production environments, the level of collaboration and complexity surrounding it has increased as well. Thankfully, platforms and tools have cropped up to help manage the machine learning lifecycle in a structured manner. One such platform that works well with PySpark is MLflow. In this chapter, we will show how MLflow can be used with PySpark. Along the way, we’ll introduce key practices that you can incorporate in your data science workflow.

Rather than starting from scratch, we’ll build upon the work that we did in Chapter 4. We will revisit our decision tree implementation using the Covtype dataset. Only this time, we’ll use MLflow for managing the machine learning lifecycle.

We’ll start by explaining the challenges and processes that encompass the machine learning lifecycle. We will then introduce MLflow and its components, as well as cover MLflow’s support for PySpark. This will be followed by an introduction to tracking machine learning training runs using MLflow. We’ll then learn how to manage machine learning models using MLflow Models. Then we’ll discuss deployment of our PySpark model and do an implementation for it. We’ll end the chapter by creating an MLflow Project. This will show how we can make our work so far reproducible for collaborators. Let’s get started by discussing the machine learning lifecycle.

Machine Learning Lifecycle ...

Get Advanced Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.