Summary

In this chapter, we delved into the Spark RDD parent-child chain and created a multiplier RDD that was able to calculate everything based on the parent RDD, and also based on the partitioning scheme on the parent. We used RDD in an immutable way. We saw that the modification of the leaf that was created from the parent didn't modify the part. We also learned a better abstraction, that is, a DataFrame, so we learned that we can employ transformation there. However, every transformation is just adding to another column—it is not modifying anything in place. Next, we just set immutability in a highly concurrent environment. We saw how the mutable state is bad when accessing multiple threads. Finally, we saw that the Dataset API is also ...

Get Hands-On Big Data Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.