Appendix A. Deeper into Spark

Understanding Spark at the level of transformations, actions, and RDDs is vital for writing Spark programs. Understanding Spark’s underlying execution model is vital for writing good Spark programs—for making sense of their performance characteristics, for debugging failures and slowness, and for interpreting the user interface.

A Spark application consists of a driver process, which in spark-shell’s case, is the process that the user is interacting with, and a set of executor processes scattered across nodes on the cluster. The driver is in charge of the high-level control flow of work that needs to be done. The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache. Both the driver and the executors typically stick around for the entire time the application is running. A single executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime.

At the top of the execution model are jobs. Invoking an action inside a Spark application triggers the launch of a Spark job to fulfill it. To decide what this job looks like, Spark examines the graph of RDDs that the action depends on and formulates an execution plan that starts with computing the farthest back RDDs and culminates in the RDDs required to produce the action’s results. The execution plan consists of assembling the job’s transformations into stages. A stage ...

Get Advanced Analytics with Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.