©  Raju Kumar Mishra 2018
Raju Kumar MishraPySpark Recipeshttps://doi.org/10.1007/978-1-4842-3141-8_4

4. Spark Architecture and the Resilient Distributed Dataset

Raju Kumar Mishra1 
(1)
Bangalore, Karnataka, India
 
You learned Python in the preceding chapter. Now it is time to learn PySpark and utilize the power of a distributed system to solve problems related to big data. We generally distribute large amounts of data on a cluster and perform processing on that distributed data.
  • This chapter covers the following recipes:

  • Recipe 4-1. Create an RDD

  • Recipe 4-2. Convert temperature data

  • Recipe 4-3. Perform basic data manipulation

  • Recipe 4-4. Run set operations

  • Recipe 4-5. Calculate summary statistics

  • Recipe 4-6. Start PySpark shell on Standalone cluster manager ...

Get PySpark Recipes: A Problem-Solution Approach with PySpark2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.