May 2017
Intermediate to advanced
270 pages
6h 18m
English
The easiest way to create an RDD in Python is with the SparkContext.parallelize method. This method was also used earlier where we parallelized a collection of integers between 0 and 1000:
rdd = sc.parallelize(range(1000)) # Result: # PythonRDD[3] at RDD at PythonRDD.scala:48
The rdd collection will be divided into a number of partitions which, in this case, correspond to a default value of four (the default value can be changed using configuration options). To explicitly specify the number of partitions, one can pass an extra argument to parallelize:
rdd = sc.parallelize(range(1000), 2) rdd.getNumPartitions() # This function will return the number of partitions # Result: # 2
RDDs support a lot of functional ...
Read now
Unlock full access