video

Debugging Apache Spark

by Holden Karau

November 2018

Intermediate

2h 26m

English

O'Reilly Media, Inc.

Closed Captioning available in German, English, Spanish, French, Japanese, Korean, Portuguese (Portugal, Brazil), Chinese (Simplified), Chinese (Traditional)

Overview

Apache Spark is an extremely powerful general purpose distributed system that also happens to be extremely difficult to debug. This video, designed for intermediate-level Spark developers and data scientists, looks at some of the most common (and baffling) ways Spark can explode (e.g., out of memory exceptions, unbalanced partitioning, strange serialization errors, debugging errors inside your own code, etc. ) and then provides a set of remedies for keeping those blow-ups under control. You'll pick up techniques for improving your own logging (and reducing your dependence on Spark's verbose logs); learn how to deal with fuzzy data; discover how to connect and use a debugger in a distributed environment; and gain the ability to know which Spark error messages are actually relevant.

Understand why Spark is difficult to debug, the types of Spark failures, and how to recognize them
Explore the differences between debugging single node and distributed systems
Learn the best debugging techniques for Spark and a framework for debugging

Holden Karau is an open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. She is an in-demand speaker at O'Reilly Media's Strata + Hadoop conferences, a committer on the Apache Spark, SystemML, and Mahout projects, and the author of multiple O'Reilly titles including High Performance Spark and Learning Spark. She holds a bachelor's degree in math and computer science from the University of Waterloo.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnb

BlueOrigin

Electronic Arts

HomeDepot

Nasdaq

Rakuten

Tata Consultancy Services

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

You might also like

Apache Spark Quick Start Guide

Apache Spark Quick Start Guide

Shrey Mehrotra, Akash Grade

Building Apache HBase Applications

Building Apache HBase Applications

Jonathan Hsieh

Building an End-to-End Batch Data Pipeline with Apache Spark

Building an End-to-End Batch Data Pipeline with Apache Spark

Mahdi Karabiben

Apache Spark with Java - Learn Spark from a Big Data Guru

Apache Spark with Java - Learn Spark from a Big Data Guru

James Lee

Publisher Resources

ISBN: 9781492039174