O'Reilly logo
live online training icon Live Online training

Getting Started with PySpark

Scaling data processing from a single machine to a distributed system

Hanna Torrence

As the amount of data being produced every day in a wide array of areas grows, data processing work quickly bumps against the limits of working on a single machine. PySpark allows analysts, engineers, and data scientists comfortable working in Python to easily move to a distributed system and take advantage of Python's mature array of data libraries alongside the power of a cluster. While on the surface PySpark dataframes appear very similar to Pandas or R dataframes, the fact that the data is distributed introduces some complicating subtleties to familiar commands. In this course you will learn how to think about distributed data, parse opaque Spark stacktraces, navigate the Spark UI, and build your own data pipelines in PySpark.

This course is aimed at people who have experience coding in Python and have at least a basic familiarity with Pandas or R dataframes. You'll walk away with the ability to translate single machine data processing into PySpark code, aware of the gotchas that could trip you up along the way. Together we'll build an example pipeline demonstrating key ideas, using public data from the Chicago City Data Portal to build a simple classification model.

What you'll learn-and how you can apply it

  • How PySpark differs from Pandas, and how to translate data pipelines between the two
  • How to debug common issues in PySpark by interpreting stacktraces and navigating the Spark UI

This training course is for you because...

  • You work with data regularly and want to be able to scale up the quantity of data processed
  • You’ve tried writing PySpark code before, but became frustrated by opaque errors or unexpected results
  • You want to learn how companies handle pipelines of terabyte-scale data coming in every day

Prerequisites

  • Experience writing Python code - for example, you should feel comfortable working with Python functions, list comprehensions, and dictionaries.
  • Experience working with dataframes, such as in Pandas or R, or experience with data transformations in SQL - for example, you should be familiar with joins and group by clauses.

Course Set-up

Course materials will be distributed in a Docker container, so you will need to have Docker installed on your machine (you don’t need to be familiar with Docker beyond this - you’ll be provided with the necessary setup commands). The Docker container will be set up to run Jupyter Notebook with PySpark, so you’ll also need access to a browser to view the notebooks and the Spark UI.

Recommended Preparation

Recommended Follow-up

  • Spark: The Definitive Guide has an excellent in-depth introduction to the concepts behind how Spark works.
  • High Performance Spark is not purely PySpark, but has excellent tips on performance, testing, and debugging.
  • Hadoop and Spark Fundamentals can teach you more about how to set up the infrastructure around Spark jobs and how Spark interacts with different data streams.
  • PySpark Cookbook has a wide variety of code examples to demonstrate different PySpark functions.
  • Debugging PySpark dives deep into tracking down useful error messages in PySpark.

About your instructor

  • Hanna works as a Data Scientist at ecommerce company ShopRunner in Chicago. After a summer fellowship with Data Science for Social Good introduced her to the joys of coaxing useful insights from messy datasets she joined ShopRunner to help grow a thriving team of data scientists exploring and building tools from a rich network of data.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Intro to Spark (20 min)

  • Concepts: lazy evaluation, data partitioning, and jobs vs. stages vs. tasks
  • Participants will: load data and perform simple operations while exploring the Spark UI

Data Exploration (30 min)

  • Concepts: data sources, visualizations, and summary statistics
  • Participants will: explore the dataset
  • Break (10 min)

Data processing (40 min)

  • Concepts: column transformations, aggregations, and user-defined functions (UDFs)
  • Participants will: build a set of features from the dataset
  • Break (10 min)

Debugging Errors (30 min)

  • Concepts: reading Spark stacktraces, navigating the Spark UI, implications of lazy evaluation
  • Participants will: work through debugging errors in provided code

Building a Model (20 min)

  • Concepts: SparkML, caching, persistence
  • Participants will: build a classification model on top of the created features

Final Q&A (10 min)