11 Faster PySpark: Understanding Spark’s query planning

This chapter covers

  • How Spark uses CPU, RAM, and hard drive resources
  • Using memory resources better to speed up (or avoid slowing down) computations
  • Using the Spark UI to review useful information about your Spark installation
  • How Spark splits a job into stages and how to profile and monitor those stages
  • Classifying transformations into narrow and wide operations and how to reason about them
  • Using caching judiciously and avoiding unfortunate performance drop with improper caching

Imagine the following scenario: you write a readable, well-thought-out PySpark program. When submitting your program to your Spark cluster, it runs. You wait.

How can we peek under the hood and see the progression ...

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.