Chapter 6. Training Models with Spark MLlib

Now that you’ve learned about managing machine learning experiments, getting a feel for the data, and feature engineering, it’s time to train some models.

What does that involve exactly? Training a model is the process of adjusting or changing model parameters so that its performance improves. The idea here is to feed your machine learning model training data that teaches it how to solve a specific task—for example, classifying an object on a photo as a cat by identifying its “cat” properties.

In this chapter, you will learn how machine learning algorithms work, when to use which tool, how to validate your model, and, most importantly, how to automate the process with the Spark MLlib Pipelines API.

At a high level, this chapter covers the following:

  • Basic Spark machine learning algorithms

  • Supervised machine learning with Spark machine learning

  • Unsupervised machine learning with Spark machine learning

  • Evaluating your model and testing it

  • Hyperparameters and tuning your model

  • Using Spark machine learning pipelines

  • Persisting models and pipelines to disk

Algorithms

Let’s start with algorithms, the essential part of your model training activities. The input of a machine learning algorithm is sample data, and its output is a model. The algorithm’s goal is to generalize the problem and extract a set of logic for making predictions and decisions without being explicitly programmed to do so. Algorithms can be based on statistics, ...

Get Scaling Machine Learning with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.