O'Reilly logo
live online training icon Live Online training

Working with Data at Scale in PySpark

How to process large datasets in a cloud environment

Topic: Data
Sahil Jhangiani
Sev Leonard

Even though personal computers are getting more powerful every year, there’s a limit to how much data processing can be done within one’s local development environment. And having the ability to work in a consistent, easily redeployable, and scalable environment is a must, particularly in a production environment with multiple engineers. In come cloud services.

Experts Sev Leonard and Sahil Jhangiani take you through the initial transition from local processing to utilizing Jupyter notebooks backed by AWS’s Elastic MapReduce (EMR) and Elastic Container Service (EC2). You’ll learn how to start up a cluster in EMR, write a basic data ingestion and analytics pipeline, and perform some performance evaluations and tuning between Spark 2+ and pandas. You’ll leave ready to begin writing data ingestion and processing pipelines without falling into some common cost pitfalls.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The tricks and pitfalls around efficiently processing large datasets from both a cost and performance viewpoint
  • How to identify the right tools for the job in the sometimes confusing cloud-based computing landscape
  • How to scale up data analytics and data science workloads to production data pipelines
  • How to use big data optimization techniques to transition workloads from tools like pandas to Spark
  • How to keep costs low and avoid runaway costs in a cloud environment

And you’ll be able to:

  • Set up a development environment with resources hosted in the cloud
  • Build modular and efficient code that scales well with both volume and complexity and doesn’t break the bank with cloud compute costs
  • Properly debug and handle the often obscured errors and issues that can often occur in this type of environment

This training course is for you because...

  • You want to become a more effective data engineer, data analyst, data scientist, or software engineer.
  • You work with large, complex datasets (or plan to in the future).


  • A fundamental knowledge of data processing, SQL, Python, and Scala
  • Familiarity with Spark (useful but not required)
  • Complete the course setup instructions

Course Set-up

  • Follow the instructions posted here

Recommended preparation:

Recommended follow-up:

About your instructors

  • Sahil Jhangiani is a senior software engineer at Nuna, where he works with the Centers for Medicare and Medicaid Services (CMS). He started his career in data while working with the Department of Energy, the Department of Urban Housing and Development, and CMS. He then became a big data engineer at Bethesda Softworks, where he worked on the data from series such as The Elder Scrolls, Doom, Wolfenstein, and Fallout. Sahil’s worked on engineering and analytics efforts in small- to enterprise-scale batch and real-time environments and has worked with analysts of various skill levels, ranging from old-school financial analysts and accountants to very technical literate data scientists.

  • Sev Leonard is a Senior Software Engineer at Fletch, devoted cat dad, and outdoors enthusiast. His interest in data began as an analog engineer working on Intel’s Core microprocessors. Since then he's worked on data management solutions in healthcare, enabling groundbreaking advances in cancer research and building TB scale data platforms for Medicaid and CHIP. He’s active in the Python community as a mentor and presenter at PyCon, PyCascades, and local meetup groups.


The timeframes are only estimates and may vary according to how the class is progressing

Introduction (40 minutes)

  • Presentation: Cloud services overview; connecting services
  • Group discussion: Why use a cluster and Spark?
  • Jupyter Notebook exercises: Spin up an Elastic MapReduce (EMR) cluster with Jupyter notebooks, Spark, and the ability to connect to Simple Storage Service (S3), CloudWatch, and external data sources
  • Q&A

Break (5 minutes)

Building a big data pipeline: Part 1 (40 minutes)

  • Presentation: A comparison of the interface, performance, and backend methodologies of PySpark 2.4, pandas, and Spark 3; their relation to SQL
  • Jupyter Notebook exercises: Write a base ingestion to pull data from an external API endpoint; write another ingestion to pull data from an external S3 bucket; perform data transformations ranging from simple to more complex; walk through the debugging process for a few expected errors

Break (5 minutes)

Building a big data pipeline: Part 2 (40 minutes)

  • Presentation: Storage and partitioning methodologies
  • Jupyter Notebook exercises: Write basic testing methodologies; run the pipeline from start to finish; write out the resulting data for long-term storage
  • Q&A

Break (5 minutes)

Scaling and optimizing workloads: Part 1 (40 minutes)

  • Jupyter Notebook exercises: Rework the pipeline from the previous section with a focus on optimization and reusability; find bottlenecks through the Spark UI and query planner; rewrite sections of the codebase to utilize partitioning and multiprocessing

Break (5 minutes)

Scaling and optimizing workloads: Part 2 (40 minutes)

  • Jupyter Notebook exercises: Using a new data source, explore how to properly generalize code to be reusable; discover Spark’s uniquely strengths for this use case
  • Q&A

Spinning down resources (10 minutes)

  • Jupyter Notebook exercises: Save out data for long-term storage; spin down AWS resources

Wrap-up and Q&A (10 minutes)