O'Reilly logo
live online training icon Live Online training

Spark 3.0 First Steps

Get Started with Analytics, ETL, Streaming, Machine Learning, and Graph Compute with Apache Spark

Topic: Data
Adam Breindel

Apache Spark allows large-scale querying and processing of data for reporting, analysis, and ETL purposes; stream processing for real-time applications; and machine learning, all using a single set of abstractions and APIs without further integration.

As Spark matures, it is essential to master the core constructs and best practices of Spark in order to plan and deliver effective solutions. For example, Spark is storage agnostic -- it can process data from many storage locations and in many formats -- but there are significant implications flowing from each specific choice. This course is a comprehensive overview of the key use cases for Apache Spark, with a focus on performance and best practices appropriate to version 3.0.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • The components of Spark and how they work together
  • How Spark handles data-parallel computation, including partitioning and shuffling data
  • The latest APIs and new performance-enhancing infrastructure in Spark 3.0

And you’ll be able to:

  • Code analytics, streaming, ETL, and ML jobs for Spark
  • Use the Spark UI to understand the parallelism and performance of your jobs
  • Plan Spark deployments, whether a single cluster for one task, or a whole platform

This training course is for you because...

  • You are a data engineer, data analyst, or data scientist
  • Your company relies on Apache Spark and/or Hadoop for large-scale processing
  • You want to focus on the easiest, most performant way to get results from Spark


Useful, but not strictly required:

  • Familiarity with the basics of Python, SQL, and either Scala or Java.
  • Basics of machine learning.
  • Previous exposure to Spark is not necessary.

Recommended preparation:

Recommended follow-up:

About your instructor

  • Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.


The timeframes are only estimates and may vary according to how the class is progressing