Spark architecture

A Spark cluster is a set of processes distributed over different machines. The Driver Program is a process, such as a Scala or Python interpreter, used by the user to submit the tasks to be executed.

The user can build task graphs, similar to Dask, using a special API and submit those tasks to the Cluster Manager that is responsible for assigning these tasks to Executors, processes responsible for executing the tasks. In a multi-user system, the Cluster Manager is also responsible for allocating resources on a per-user basis.

The user interacts with the Cluster Manager through the Driver Program. The class responsible for communication between the user and the Spark cluster is called SparkContext. This class is able to ...

Get Python High Performance - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.