Chapter 24. Advanced Analytics and Machine Learning Overview

Thus far, we have covered fairly general data flow APIs. This part of the book will dive deeper into some of the more specific advanced analytics APIs available in Spark. Beyond large-scale SQL analysis and streaming, Spark also provides support for statistics, machine learning, and graph analytics. These encompass a set of workloads that we will refer to as advanced analytics. This part of the book will cover advanced analytics tools in Spark, including:

  • Preprocessing your data (cleaning data and feature engineering)

  • Supervised learning

  • Recommendation learning

  • Unsupervised engines

  • Graph analytics

  • Deep learning

This chapter offers a basic overview of advanced analytics, some example use cases, and a basic advanced analytics workflow. Then we’ll cover the analytics tools just listed and teach you how to apply them.


This book is not intended to teach you everything you need to know about machine learning from scratch. We won’t go into strict mathematical definitions and formulations—not for lack of importance but simply because it’s too much information to include. This part of the book is not an algorithm guide that will teach you the mathematical underpinnings of every available algorithm nor the in-depth implementation strategies used. The chapters included here serve as a guide for users, with the purpose of outlining what you need to know to use Spark’s advanced analytics APIs.

A Short Primer ...

Get Spark: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.