Get full access to Python Data Science Essentials - Third Edition and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Start your free trial

Experimenting with Resilient Distributed Datasets

Now let's create a Resilient Distributed Dataset containing integers from 0 to 9. To do so, we can use the parallelize method provided by the SparkContext object:

In: numbers = range(10)    numbers_rdd = sc.parallelize(numbers)    numbers_rddOut: PythonRDD[2672] at RDD at PythonRDD.scala:49

As you can see, you can't simply print the RDD content, as it is split into multiple partitions (and distributed in the cluster). The default number of partitions is twice the number of CPUs (so, it's four in the provided VM), but it can be set manually using the second argument of the parallelize method.

To print out the data contained in the RDD, you should call the collect method. Note that this operation, ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now