Spark architecture

A Spark cluster is a set of processes distributed over different machines. The Driver Program is a process, such as a Scala or Python interpreter, used by the user to submit the tasks to be executed.

The user can build task graphs, similar to Dask, using a special API and submit those tasks to the Cluster Manager that is responsible for assigning these tasks to Executors, processes responsible for executing the tasks. In a multi-user system, the Cluster Manager is also responsible for allocating resources on a per-user basis.

The user interacts with the Cluster Manager through the Driver Program. The class responsible for communication between the user and the Spark cluster is called SparkContext. This class is able to ...

Get Python High Performance - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.