Chapter 1. Introduction to Spark and PySpark
Spark is a powerful analytics engine for large-scale data processing that aims at speed, ease of use, and extensibility for big data applications. It’s a proven and widely adopted technology used by many companies that handle big data every day. Though Spark’s “native” language is Scala (most of Spark is developed in Scala), it also provides high-level APIs in Java, Python, and R.
In this book we’ll be using Python via PySpark, an API that exposes the Spark programming model to Python. With Python being the most accessible programming language and Spark’s powerful and expressive API, PySpark’s simplicity makes it the best choice for us. PySpark is an interface for Spark in the Python programming language that provides the following two important features:
-
It allows us to write Spark applications using Python APIs.
-
It provides the PySpark shell for interactively analyzing data in a distributed environment.
The purpose of this chapter is to introduce PySpark as the main component of the Spark ecosystem and show you that it can be effectively used for big data tasks such as ETL operations, indexing billions of documents, ingesting millions of genomes, machine learning, graph data analysis, DNA data analysis, and much more. I’ll start by reviewing the Spark and PySpark architectures, and provide examples to show the expressive power of PySpark. I will present an overview of Spark’s core functions (transformations and actions) and ...