Introduction to Spark Distributed Processing

Lesson Objectives

By the end of this chapter, you will be able to:

  • Write Python programs that execute parallel operations inside a Spark cluster
  • Create and transform resilient distributed datasets
  • Write standalone Python programs to interact with Spark
  • Build DataFrames and perform SQL queries

In this lesson, you will be interacting with Spark using Python.


Apache Spark is a cluster computing framework that provides a collection of APIs. These APIs serve the purpose of performing general-purpose computation in clustered systems.

We can illustrate how Spark can be used in the real world with the example of a content provider that delivers movies, documentaries, and TV shows across ...

Get Big Data Processing with Apache Spark now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.