Shared variables

We touched upon shared variables in Chapter 2, Transformations and Actions with Spark RDDs, we did not go into more details as this is considered to be a slightly advanced topic with lots of nuances around what can and cannot be shared. To briefly recap we discussed two types of Shared Variables:

  • Broadcast variables
  • Accumulators

Broadcast variables

Spark is an MPP architecture where multiple nodes work in parallel to achieve operations in an optimal way. As the name indicates, you might want to achieve a state where each node has its own copy of the input/interim data set, and hence broadcast that across the cluster. From previous knowledge we know that Spark does some internal broadcasting of data while executing various actions. ...

Get Learning Apache Spark 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.