Experimenting with Resilient Distributed Datasets

Now let's create a Resilient Distributed Dataset containing integers from 0 to 9. To do so, we can use the parallelize method provided by the SparkContext object:

In: numbers = range(10)    numbers_rdd = sc.parallelize(numbers)    numbers_rddOut: PythonRDD[2672] at RDD at PythonRDD.scala:49

As you can see, you can't simply print the RDD content, as it is split into multiple partitions (and distributed in the cluster). The default number of partitions is twice the number of CPUs (so, it's four in the provided VM), but it can be set manually using the second argument of the parallelize method.

To print out the data contained in the RDD, you should call the collect method. Note that this operation, ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.