O'Reilly logo
live online training icon Live Online training

Spark 3.0 First Steps

Get started with analytics, ETL, streaming, machine learning, and graph compute with Apache Spark

Topic: Data
Adam Breindel

Apache Spark allows large-scale querying and processing of data for reporting, analysis, and ETL purposes; stream processing for real-time applications; and machine learning—all using a single set of abstractions and APIs without further integration.

Expert Adam Breindel walks you through the key use cases for Apache Spark, with a focus on performance and best practices appropriate to version 3.0. Join in to master the Spark core constructs and best practices that will help you plan and deliver effective solutions.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The components of Spark and how they work together
  • How Spark handles data-parallel computation, including partitioning and shuffling data
  • The latest APIs and new performance-enhancing infrastructure in Spark 3.0

And you’ll be able to:

  • Implement code analytics, streaming, ETL, and ML jobs for Spark
  • Use the Spark UI to understand the parallelism and performance of your jobs
  • Plan Spark deployments, whether a single cluster for one task or a whole platform

This training course is for you because...

  • You’re a data engineer, data analyst, or data scientist.
  • Your company relies on Apache Spark and/or Hadoop for large-scale processing.
  • You want to learn the easiest, most performant way to get results from Spark.

Prerequisites

  • A Databricks account
  • A basic understanding of machine learning, Python, SQL, and either Scala or Java (useful but not required)
  • Spark experience not required

Recommended preparation:

Recommended follow-up:

About your instructor

  • Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (60 minutes)

  • Presentation: Apache Spark’s purpose and history; latest news; the meaning (and implications) of Spark as a distributed data-parallel compute engine; the main players in Spark—the driver, executors, and the cluster manager
  • Hands-on exercises: Set up an account with Databricks Community Edition, load the course notebooks, and create your first Spark cluster; query data with Spark—hands-on data processing the easiest way, with pure SQL; examine data with SparkSQL
  • Q&A

Break (5 minutes)

Spark DataFrames (50 minutes)

  • Presentation: Visualizing data for analysis; temporary views; how Spark divides up big datasets; partitioning basics
  • Demo and hands-on exercises: Query data using the object-oriented (non-SQL) API; explore Spark’s idea of DataFrames and columns; read and write different data formats (CSV, Parquet, etc.); break and fix parallelism
  • Q&A

Break (10 minutes)

Understanding job execution and the Spark application monitoring web UI (60 minutes)

  • Presentation: Applications and jobs; stages and tasks; using the stage detail UI page to understand performance; tuning guidelines
  • Hands-on exercises: Examine jobs in the Spark web UI; navigate the Spark UI Storage, Stages, and Tasks pages
  • Q&A

Break (10 minutes)

Feature preparation and machine learning with Spark ML (60 minutes)

  • Presentation: Understanding the Spark ML approach; using the docs; visualizing data for ML; preparing data features for machine learning, using Spark ML Transformers and Estimators; training ML models; validation and tuning with Spark ML; model deployment tips
  • Hands-on exercises: Sample and visualize data; visualize distributions over categoricals and encode categoricals; create a Spark ML model
  • Q&A

Break (5 minutes)

Continuous applications with Spark Structured Streaming (40 minutes)

  • Presentation: Challenges and solutions with streaming data; stateful aggregation and performance
  • Hands-on exercise: Perform easy streaming programming with DataFrames; achieve exactly-once processing with data sinks
  • Group discussion: Achieving at-least-once processing with data sources
  • Q&A