June 2018
Intermediate to advanced
330 pages
9h 47m
English
Under the covers, there are quite a few actions that happened when you created your RDD. Let's start with the RDD creation and break down this code snippet:
myRDD = sc.parallelize( [('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)])
Focusing first on the statement in the sc.parallelize() method, we first created a Python list (that is, [A, B, ..., E]) composed of a list of arrays (that is, ('Mike', 19), ('June', 19), ..., ('Scott', 17)). The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data:
Now that we have created ...
Read now
Unlock full access