O'Reilly logo
live online training icon Live Online training

Apache Spark ML First Steps

How to build your own machine learning model at scale

Topic: Data
Adi Polak

In order to create good products that leverage AI, you need to run machine learning algorithms on massive amounts of data. To do so, you can leverage existing distributed machine learning frameworks, such as Spark ML, which help simplify the development and use of large-scale machine learning.

Join expert Adi Polak for an introduction to Apache Spark and Spark ML. You’ll learn how the tools work under the hood and get hands-on experience as you use Spark ML to build a machine learning model and leverage data mining techniques for an example bots text pattern use case on Twitter. You’ll also work with a real dataset that’s been filtered out and annotated.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand: - Apache Spark basic architecture - Spark ML framework architecture - How to create data science research with Apache Spark and Spark ML that includes machine learning algorithms, featurization, pipelines, persistence (saving and loading algorithms, models and more), and utilities like statistics

And you’ll be able to: - Work with Apache Spark - Build your own machine learning models at scale - Continue improving your Twitter bot detector - Understand when Apache Spark + Spark ML is the right tool to use for machine learning training (as opposed to just using Apache Spark for data prep)

This training course is for you because...

  • You’re an engineer interested in machine learning or a data scientist interested in machine learning at scale.
  • You work with Apache Spark and Spark ML.
  • You want to better understand how to implement machine learning at scale.


  • A computer with Python 3.0 and Docker installed and the Twitter bot dataset downloaded (The course will use Docker Engine, Community Edition, version 19.03.5.)
  • A basic understanding of Python and Docker
  • Familiarity with the Jupyter Notebook

Recommended preparation: - Before class, run this terminal command in order to start a PySpark notebook on your machine. This will download a jupyter/pyspark-notebook image and run a Jupyter Notebook Python, Spark, and Mesos stack on your machine. Please follow instructions according to your machine:

Mac: docker run -it -p 8888:8888 jupyter/pyspark-notebook

Linux: sudo docker run -it -p 8888:8888 jupyter/pyspark-notebook

About your instructor

  • Adi Polak is a Senior Software Engineer and Developer Advocate in the Azure Engineering organization at Microsoft. Her work focuses on microservices architecture, distributed systems, real-time processing, big data analysis and machine learning. Her advocacy work focuses on bringing her vast industry research & engineering experience to bear in helping teams design, architect and build cost-effective software and infrastructure solutions that emphasize scalability, team expertise and business market fit.


The timeframes are only estimates and may vary according to how the class is progressing

Introduction to bot data (25 minutes) - Presentation: What is a bot?; introduction to classified real Twitter bot data - Q&A

Apache Spark basics (10 minutes) - Presentation: Apache Spark architecture

Intro to data cleaning and preparation (40 minutes) - Presentation: The ML lifecycle; how it’s done with Spark DataFrames - Hands-on exercises: Load the data into a Spark Dataframe; filter the DataFrame; query the DataFrame using SQL to get a feel for the data

Break (5 minutes)

Apache Spark ML: Create a train and test set (15 minutes) - Presentation: Machine learning train and test sets - Hands-on exercise: Learn to divide the data

Run Spark ML and create machine learning models over Twitter data (35 minutes) - Presentation: Spark ML + code examples - Hands-on exercises: Build a simple ML model using Spark ML; run the classifier and get a feel for the classifier results - Q&A

Break (5 minutes)

Introduction to evaluating ML models and using pipelines (40 minutes) - Presentation: The machine learning cycle and Spark ML pipelines - Hands-on exercise: Create your Spark ML pipeline; create code to evaluate your ML model; connect everything together—bot data, your Spark ML pipeline, and evaluation

Wrap-up and final Q&A (5 minutes)