July 2017
Intermediate to advanced
796 pages
18h 55m
English
Resilient Distributed Datasets (RDDs) are the fundamental object used in Apache Spark. RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation or action. They also store the lineage, which is used to recover from failures. We have also seen in the previous chapter some details about how RDDs can be created and what kind of operations can be applied to RDDs.
The following is a simply example of the RDD lineage:

Let's start looking at the simplest RDD again by creating a RDD from a sequence ...
Read now
Unlock full access