O'Reilly logo
live online training icon Live Online training

Stream processing with Apache Spark

enter image description here

Mastering structured streaming

Topic: Data
Gerard Maas

Stream processing allows you to analyze and extract value from data as soon as it’s available. Businesses can observe and react to changes as soon as they happen, turning them into actionable insights and, ultimately, a competitive advantage.

Apache Spark is a unified analytics engine that offers batch and streaming capabilities compatible with a polyglot approach to data analytics, with APIs in Scala, Java, Python, and R. Join expert Gerard Maas to learn how to apply Apache Spark’s streaming capabilities to extract value from the streams of data available today.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How the streaming dataset abstraction and API lets you operate on the streaming data
  • What a streaming source is and how it helps you obtain data from a streaming data producer
  • What a streaming sink is and how it helps you produce data to other systems
  • The importance of event time, how to use it in aggregations, and its requirements and limitations
  • How to use the Stateful Processing API to create arbitrary aggregations over a stream
  • How to use Spark ML to apply machine learning (ML) models to a data stream

And you’ll be able to:

  • Write structured streaming jobs that apply your business logic to streaming data by transforming and aggregating the data
  • Read and write to Kafka as a streaming backend
  • Load and apply a pretrained ML model to a data stream to score the data

This training course is for you because...

  • You’re a data engineer who wants to move workloads to a streaming model.
  • You work with Spark and want to better understand its streaming capabilities.
  • You want to develop streaming superpowers as a data engineer.

Prerequisites

  • A basic understanding of Scala
  • General knowledge of Spark SQL (e.g., Dataset and DataFrames)

Recommended preparation:

Recommended follow-up:

About your instructor

  • Gerard Maas is a Principal Engineer at Lightbend, where he contributes to the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the author of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker at conferences and meetups. He likes to contribute to small and large open-source projects.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction to stream processing with Apache Spark (25 minutes)

  • Lecture: Course overview; the general streaming model
  • Group discussion: Your background, programming experience, and exposure to data processing and streaming
  • Q&A

Building your intuition: Batch versus stream processing (30 minutes)

  • Lecture: Moving a batch process into a streaming application
  • Hands-on exercise: Complete the batch versus streaming analytics notebook
  • Q&A
  • Break (5 minutes)

Core components of the Structured Streaming API (35 minutes)

  • Lecture: The Structured Streaming API; sources; data processing, transformations, and joins; sinks
  • Exercise: Build an IoT stream processing pipeline using Kafka and structured streaming
  • Q&A

Stateful computations in structured streaming (40 minutes)

  • Lecture: Stateful operations, requirements, and challenges; support for event time, window functions, and other built-in stateful operations
  • Hands-on exercise: Implement stateful aggregations
  • Q&A
  • Break (5 minutes)

Applying ML models in structured streaming (40 minutes)

  • Lecture: Integrating Spark ML with structured streaming; using a pretrained ML model to score a data stream and predict conditions based on the new data
  • Hands-on exercise: Apply ML techniques to predict room occupancy
  • Q&A