Apache Spark is an extremely powerful general purpose distributed system that also happens to be extremely difficult to debug. This video, designed for intermediate-level Spark developers and data scientists, looks at some of the most common (and baffling) ways Spark can explode (e.g., out of memory exceptions, unbalanced partitioning, strange serialization errors, debugging errors inside your own code, etc. ) and then provides a set of remedies for keeping those blow-ups under control. You'll pick up techniques for improving your own logging (and reducing your dependence on Spark's verbose logs); learn how to deal with fuzzy data; discover how to connect and use a debugger in a distributed environment; and gain the ability to know which Spark error messages are actually relevant.
Holden Karau is an open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. She is an in-demand speaker at O'Reilly Media's Strata + Hadoop conferences, a committer on the Apache Spark, SystemML, and Mahout projects, and the author of multiple O'Reilly titles including High Performance Spark and Learning Spark. She holds a bachelor's degree in math and computer science from the University of Waterloo.