Mastering Spark for Structured Streaming

Video description

Spark is one of today’s most popular distributed computation engines for processing and analyzing big data. This course provides data engineers, data scientist and data analysts interested in exploring the technology of data streaming with practical experience in using Spark. You’ll learn about the Spark Structured Streaming API, the powerful Catalyst query optimizer, the Tungsten execution engine, and more in this hands-on course where you’ll build small several applications that leverage all the aspects of Spark 2.0. While not a requirement, the course works best for those with some Scala experience.

  • Understand the main features of Spark and its advantages over existing systems
  • Learn the basics of parallelism, streaming computation, and Spark streaming
  • Explore the distinctions between Spark Structured Streaming and legacy DStream APIs
  • Understand how to write to and use the Spark Structured Streaming API
  • Learn about the new Catalyst query optimizer and the Tungsten execution engine
  • Discover how Scala and Spark Structured Streaming simplify distributed streaming tasks
  • Gain hands-on experience building applications using Spark 2.0

Michael Li is the founder of The Data Incubator, which provides big data corporate training and a selective eight-week fellowship for PhDs transitioning into industry. Previously, he worked as a data scientist, software engineer, and researcher at Foursquare, Google, Andreessen Horowitz, J.P. Morgan, and NASA. He is a regular contributor to VentureBeat, The Next Web, and Harvard Business Review. Michael earned his Ph.D. at Princeton and was a Marshall Scholar in Cambridge.

Publisher resources

Download Example Code

Table of contents

  1. Overview
    1. Overview 00:02:06
  2. Spark Datasets and Structured Streaming
    1. Spark Overview 00:02:11
    2. Spark Wordcount Using RDD Example 00:05:01
    3. Spark Wordcount Using Scala Example 00:02:37
    4. Spark and Datasets 00:01:56
    5. Spark Wordcount Using Datasets Example 00:03:06
    6. Joining Data Using Spark Datasets 00:03:32
    7. Structured Streaming Overview 00:03:18
    8. Spark Structured Streaming Wordcount Example 00:03:20
  3. Spark Structured Streaming
    1. Spark Structured Streaming 00:00:46
    2. Netcat Socket Structured Streaming Example 00:02:27
    3. Socket Structured Streaming Example 00:02:55
    4. Spark Structured Streaming Parsing Data 00:02:56
    5. Constructing Columns in Structured Streaming 00:02:47
    6. Selecting and Filtering Columns Using Structured Streaming 00:02:07
    7. GroupBy and Aggregation in Structured Streaming 00:03:33
    8. Joining Structured Stream with Datasets 00:03:39
    9. SQL Queries in Spark Structured Streaming 00:02:19
  4. DStream Comparison
    1. Comparing Structured Streaming with DStream 00:03:39
    2. Custom Receivers in Spark DStream 00:02:18
    3. Iterative Wordcount Using Spark DStream 00:03:30
    4. Cumulative Wordcount using Spark DStream 00:06:31
    5. Benefits of Spark Tungsten 00:04:43
    6. Tungsten Performance Benefit Demonstration 00:02:58
    7. Benefits of Spark Catalyst 00:03:18
    8. Viewing Query Plans in Spark Shell 00:01:36
    9. Visualizing Query Stages in Spark UI Viewer 00:00:51
    10. Viewing Spark Catalyst-Optimized Physical Plans 00:02:56
  5. Standalone Spark Streaming Applications
    1. Writing Standalone Spark Streaming Applications 00:01:03
    2. Two Environments for Running Spark 00:01:57
    3. Spark Streaming Standalone Code - Meetup Events Example 00:07:37
    4. Scala Build Tool (SBT) and Spark 00:06:01
    5. Compiling and Building a Standalone Spark Application 00:04:29
    6. Spark Twitter Streaming Example 00:03:54

Product information

  • Title: Mastering Spark for Structured Streaming
  • Author(s): Tianhui Michael Li
  • Release date: November 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491974438