O'Reilly logo
live online training icon Live Online training

Introduction to Apache Spark 2.x

Introduction to Spark for Data Processing, Analytics, and Machine Learning

Adam Breindel

Learn modern best practices, using the latest Spark features, for high-performance analytics, processing, and modeling on large-scale data sets. Using elementary Scala and accessible to those with basic Scala or Python knowledge, this course will introduce you to the broad functionality of Spark 2.1, providing examples and hands-on activities to follow along with, in a notebook environment.

What you'll learn-and how you can apply it

  • How Spark executes queries and jobs over heterogeneous, distributed data
  • How Spark applications and clusters operate
  • Parallel data processing
  • How Spark analyzes queries or computations and executes them in a distributed cluster
  • Using the newest Spark APIs, features, and best practices, which are not present in the large amount of online Spark material (which is based on older, earlier versions of Spark)

Participants will be able to:

  • Author data processing and transformation scripts
  • Query and analyze data
  • Train, evaluate, and deploy machine learning (predictive analytics) models

This training course is for you because...

  1. You are a data analyst with a SQL background and you need to implement reports or analytic queries over large, heterogeneous datasets.
  2. You are a data engineer with a programming or scripting background and you need to plan or operate data processing clusters and pipelines.
  3. You are a data scientist with a background in Python and you need to train models on large scale datasets, or apply an existing model to large datasets.


  • Elementary programming skill in Scala or Python
  • Basic familiarity with Java Virtual Machine (JVM) helpful but not required
  • Previous knowledge of Spark is not necessary

Materials and setup instructions

Attendees will need to make a (free, optionally anonymous) Databricks account (tinyurl.com/databricks-ce to sign up), and they'll need to be able to access that account as well as imgur.com and at least one of: box.com, dropbox.com, or Google Drive URLs from the location (e.g., work computer) where they want to attend class.

About your instructor

  • Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam's experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.


The timeframes are only estimates and may vary according to how the class is progressing

Day 1:

  • Welcome, Intro to Spark (15 minutes)
  • Spark, Distributed Data Basics (45 minutes)
  • Programming Spark with SQL, DataFrame, and Dataset APIs (2 hours)

Day 2:

  • Identifying some problems and fixes using the Spark Web UI (30 minutes)
  • RDDs vs. DataFrame/Dataset (30 minutes)
  • Spark Streaming Basics (1 hour 30 minutes)

Day 3:

  • Spark Machine Learning Intro (1 hour 30 minutes)
  • Spark clustering and deployment options, Q&A (1 hour)