Chapter 10. Machine Learning with MLlib

Up until this point, we have focused on data engineering workloads with Apache Spark. Data engineering is often a precursory step to preparing your data for machine learning (ML) tasks, which will be the focus of this chapter. We live in an era in which machine learning and artificial intelligence applications are an integral part of our lives. Chances are that whether we realize it or not, every day we come into contact with ML models for purposes such as online shopping recommendations and advertisements, fraud detection, classification, image recognition, pattern matching, and more. These ML models drive important business decisions for many companies. According to this McKinsey study, 35% of what consumers purchase on Amazon and 75% of what they watch on Netflix is driven by machine learning–based product recommendations. Building a model that performs well can make or break companies.

In this chapter we will get you started building ML models using MLlib, the de facto machine learning library in Apache Spark. We’ll begin with a brief introduction to machine learning, then cover best practices for distributed ML and feature engineering at scale (if you’re already familiar with machine learning fundamentals, you can skip straight to “Designing Machine Learning Pipelines”). Through the short code snippets presented here and the notebooks available in the book’s GitHub repo, you’ll learn how to build basic ML models and use MLlib.

Note ...

Get Learning Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.