Chapter 13. Advanced RDDs
Chapter 12 explored the basics of single RDD manipulation. You learned how to create RDDs and why you might want to use them. In addition, we discussed map, filter, reduce, and how to create functions to transform single RDD data. This chapter covers the advanced RDD operations and focuses on key–value RDDs, a powerful abstraction for manipulating data. We also touch on some more advanced topics like custom partitioning, a reason you might want to use RDDs in the first place. With a custom partitioning function, you can control exactly how data is laid out on the cluster and manipulate that individual partition accordingly. Before we get there, let’s summarize the key topics we will cover:
-
Aggregations and key–value RDDs
-
Custom partitioning
-
RDD joins
Note
This set of APIs has been around since, essentially, the beginning of Spark, and there are a ton of examples all across the web on this set of APIs. This makes it trivial to search and find examples that will show you how to use these operations.
Let’s use the same dataset we used in the last chapter:
// in ScalavalmyCollection="Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")valwords=spark.sparkContext.parallelize(myCollection,2)
# in PythonmyCollection="Spark The Definitive Guide : Big Data Processing Made Simple"\.split(" ")words=spark.sparkContext.parallelize(myCollection,2)
Key-Value Basics (Key-Value RDDs)
There are many methods on RDDs that require ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access