O'Reilly logo

Fast Data Processing with Spark by Holden Karau

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Manipulating your RDD in Python

Spark has a more limited API than Java and Scala, but supports most of the core functionalities.

The hallmarks of a MapReduce system are the two commands: map and reduce. You've seen the map function used in the past chapters. The map function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1). It's important to understand that the map function and the other Spark functions do not transform the existing elements, rather they return a new RDD with the new elements. The reduce function takes a function that operates on pairs to combine ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required