Introduction to Spark Distributed Processing

Lesson Objectives

By the end of this chapter, you will be able to:

  • Write Python programs that execute parallel operations inside a Spark cluster
  • Create and transform resilient distributed datasets
  • Write standalone Python programs to interact with Spark
  • Build DataFrames and perform SQL queries

In this lesson, you will be interacting with Spark using Python.


Apache Spark is a cluster computing framework that provides a collection of APIs. These APIs serve the purpose of performing general-purpose computation in clustered systems.

We can illustrate how Spark can be used in the real world with the example of a content provider that delivers movies, documentaries, and TV shows across ...

