Data structure-based transformations are transformation functions which operate on the underlying data structures of the RDD, the partitions in the RDD. In these functions, you can directly work on partitions without directly touching the elements/data inside the RDD. These are essential in any Spark program beyond the simple programs where you need more control of the partitions and distribution of partitions in the cluster. Typically, performance improvements can be realized by redistributing the data partitions according to the cluster state and the size of the data, and the exact use case requirements.
Examples of such transformations are:
- partitionBy
- repartition
- zipwithIndex
- coalesce
The following ...