Raju Kumar MishraPySpark Recipeshttps://doi.org/10.1007/978-1-4842-3141-8_4

4. Spark Architecture and the Resilient Distributed Dataset

Raju Kumar Mishra¹

(1)

Bangalore, Karnataka, India

You learned Python in the preceding chapter. Now it is time to learn PySpark and utilize the power of a distributed system to solve problems related to big data. We generally distribute large amounts of data on a cluster and perform processing on that distributed data.

This chapter covers the following recipes:
Recipe 4-1. Create an RDD
Recipe 4-2. Convert temperature data
Recipe 4-3. Perform basic data manipulation
Recipe 4-4. Run set operations
Recipe 4-5. Calculate summary statistics
Recipe 4-6. Start PySpark shell on Standalone cluster manager ...

Get PySpark Recipes: A Problem-Solution Approach with PySpark2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

PySpark Recipes: A Problem-Solution Approach with PySpark2 by Raju Kumar Mishra

4. Spark Architecture and the Resilient Distributed Dataset

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly