Manipulating your RDD in Python
Spark has a more limited Python API than Java and Scala, but it supports most of the core functionality.
The hallmark of a MapReduce
system lies in two commands: map
and reduce
. You've seen the map
function used in the earlier chapters. The map
function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1)
. It's important to understand that the map
function and the other Spark functions do not transform the existing elements; instead, they return a new RDD with new elements. The reduce
function takes a function that operates in pairs to ...
Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.