Data structure-based transformations

Data structure-based transformations are transformation functions which operate on the underlying data structures of the RDD, the partitions in the RDD. In these functions, you can directly work on partitions without directly touching the elements/data inside the RDD. These are essential in any Spark program beyond the simple programs where you need more control of the partitions and distribution of partitions in the cluster. Typically, performance improvements can be realized by redistributing the data partitions according to the cluster state and the size of the data, and the exact use case requirements.

Examples of such transformations are:

  • partitionBy
  • repartition
  • zipwithIndex
  • coalesce

The following ...

Get Scala and Spark for Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.