Manipulating your RDD in Python

Spark has a more limited Python API than Java and Scala, but it supports for most of the core functionality.

The hallmark of a MapReduce system are the two commands map and reduce. You've seen the map function used in the past chapters. The map function works by taking in a function that works on each individual element in the input RDD and produces a new output element. For example, to produce a new RDD where you have added one to every number, you would use rdd.map(lambda x: x+1). It's important to understand that the map function and the other Spark functions, do not transform the existing elements; rather they return a new RDD with new elements. The reduce function takes a function that operates on pairs to combine ...

Get Fast Data Processing with Spark - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.