This chapter covers the oldest foundational concept in Spark called resilient distributed datasets (RDDs). To truly understand how Spark works, you must understand the essence of RDDs. They provide an extremely solid foundation that other abstractions are built upon. The ideas behind RDDs are pretty unique in the distributed data processing framework landscape, and they were introduced in a timely manner to solve the pressing needs of dealing with the complexity and efficiency of iterative and interactive data processing use cases. Starting with Spark 2.0, Spark users will have fewer ...
Get Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.