It’s common for applications to be initially developed as a proof-of-concept where job failures, extremely large datasets, and necessary service level agreements (SLAs) don’t come into play. This is more so within Spark, given that most developers are starting out with the framework to accomplish such things like application migration, exploratory analytics, or testing machine learning ideas.
But, regardless of the initial application, there is a point in every developer’s application lifecycle where they’ve scaled it to the point that it finally doesn’t just “work out of the box.” This can arise from Out Of Memory (OOM) exceptions, continuous job failures, or a crashing driver.
This chapter will focus solely on understanding those issues and how to build fault tolerance into your application so that it can be ready for the next stage of its lifecycle (i.e., production). Specifically, we’ll explore the how and why of job scheduling, the concepts and configurations necessary for making your application fault tolerant, and finally we’ll look at inherent hardware limitations and optimization techniques present in the later versions of Spark (v1.5 at the time of this writing).
Moreover, we won’t be discounting all of the various components of Spark either. With the advent of the new DataFrame API, SparkSQL, and others, we need to be sure that fault tolerance and job execution can be sustained, whether using the core of Spark or any one ...