In the previous two chapters we followed the steps necessary to make a Spark application run on top of a stable production environment. But those steps alone are not enough—we still want to do more to get the best out of the application. Even though Spark, as a cluster computing framework, capably deals with the major problems that distributed processing is involved with, such as scalability, fault tolerance, job scheduling, load balancing, and so on, the task of writing efficient code that can run distributed across a cluster is not trivial. We always risk encountering performance bottlenecks.
In this chapter we will explore several techniques to improve the performance of Spark jobs and to avoid potential bottlenecks. Understanding Spark’s fundamentals is the first thing you must do in order to write efficient jobs.
With performance in mind, in this chapter we start by tackling Spark’s execution model, describing the way data is being shuffled. We will see why we should avoid data shuffling, and when it is a good time to do it. We will also explain why partitioning is important and how efficiency is impacted by the way we choose to use the Spark operators.
We will also include a section in this chapter dedicated to data serialization that evaluates the supported serializers: Java and Kryo.
An important player that improves a Spark application’s performance is the caching mechanism. Saving intermediary results or tables in memory can preserve a ...