Troubleshooting Apache Spark

Video Description

Quick, simple solutions to common development issues and Debugging techniques with Apache Spark.

About This Video

  • Optimize resources and costs by utilizing Spark's speed
  • Troubleshoot the Spark execution DAG by exploring Spark logical and physical query plans to perform the same logic on fewer executors and machines
  • Solve the problem of slow-running jobs by speeding up feedback loops by creating efficient transformations and joins using Spark APIs

In Detail

Apache Spark has been around quite some time, but do you really know how to solve the development issues and problems you face with it? This course will give you new possibilities and you'll cover many aspects of Apache Spark; some you may know and some you probably never knew existed. If you take a lot of time learning and performing tasks on Spark, you are unable to leverage Apache Spark's full capabilities and features, and face a roadblock in your development journey. You'll face issues and will be unable to optimize your development process due to common problems and bugs; you'll be looking for techniques which can save you from falling into any pitfalls and common errors during development. With this course you'll learn to implement some practical and proven techniques to improve particular aspects of Apache Spark with proper research

You need to understand the common problems and issues Spark developers face, collate them, and build simple solutions for these problems. One way to understand common issues is to look out for Stack Overflow queries. This course is a high-quality troubleshooting course, highlighting issues faced by developers in different stages of their application development and providing them with simple and practical solutions to these issues. It supplies solutions to some problems and challenges faced by developers; however, this course also focuses on discovering new possibilities with Apache Spark. By the end of this course, you will have solved your Spark problems without any hassle.

All the code and supporting files for this course are available on Github at

Table of Contents

  1. Chapter 1 : Common Problems and Troubleshooting the Spark Distributed Engine
    1. The Course Overview 00:03:00
    2. Eager Computations: Lazy Evaluation 00:04:56
    3. Caching Values: In-Memory Persistence 00:06:39
    4. Unexpected API Behavior: Picking the Proper RDD API 00:04:24
    5. Wide Dependencies: Using Narrow Dependencies 00:04:54
  2. Chapter 2 : Distributed DataFrames Optimization Pitfalls
    1. Making Computations Parallel: Using Partitions 00:05:42
    2. Defining Robust Custom Functions: Understanding User-Defined Functions 00:04:43
    3. Logical Plans Hiding the Truth: Examining the Physical Plans 00:05:59
    4. Slow Interpreted Lambdas: Code Generation Spark Optimization 00:04:18
  3. Chapter 3 : Distributed Joins in Cluster
    1. Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume 00:07:24
    2. Slow Joins: Choosing an Execution Plan for Join 00:05:37
    3. Distributed Joins Problem: DataFrame API 00:05:08
    4. TypeSafe Joins Problem: The Newest DataSet API 00:03:40
  4. Chapter 4 : Solving Problems with Non-Efficient Transformations
    1. Minimizing Object Creation: Reusing Existing Objects 00:05:04
    2. Iterating Transformations – The mapPartitions() Method 00:03:49
    3. Slow Spark Application Start: Reducing Setup Overhead 00:04:38
    4. Performing Unnecessary Recomputation: Reusing RDDs 00:04:07
  5. Chapter 5 : Troubleshooting Real-Time Processing Jobs in Spark Streaming
    1. Repeating the Same Code in Stream Pipeline: Using Sources and Sinks 00:06:34
    2. Long Latency of Jobs: Understanding Batch Internals 00:05:04
    3. Fault Tolerance: Using Data Checkpointing 00:03:24
    4. Maintaining Batch and Streaming: Using Structured Streaming Pros 00:04:03

Product Information

  • Title: Troubleshooting Apache Spark
  • Author(s): Tomasz Lelek
  • Release date: November 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789805253