Partitioning and shuffling
We have seen how Apache Spark can handle distributed computing much better than Hadoop. We also saw the inner workings, mainly the fundamental data structure known as Resilient Distributed Dataset (RDD). RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. RDDs operate on data not as a single blob of data, rather RDDs manage and operate data in partitions spread across the cluster. Hence, the concept of data partitioning is critical to the proper functioning of Apache Spark Jobs and can have a big effect on the performance as well as how the resources are utilized.
RDD consists of partitions of data and all operations are performed on the partitions ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access