Spark Programming in Python for Beginners with Apache Spark 3

Video description

If you are looking to expand your knowledge in data engineering or want to level up your portfolio by adding Spark programming to your skillset, then you are in the right place. This course will help you understand Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session-like approach. We will be taking a live coding approach and explaining all the concepts needed along the way.

In this course, we will start with a quick introduction to Apache Spark, then set up our environment by installing and using Apache Spark. Next, we will learn about Spark execution model and architecture, and about Spark programming model and developer experience. Next, we will cover Spark structured API foundation and then move towards Spark data sources and sinks.

Then we will cover Spark Dataframe and dataset transformations. We will also cover aggregations in Apache Spark and finally, we will cover Spark Dataframe joins.

By the end of this course, you will be able to build data engineering solutions using Spark structured API in Python.

What You Will Learn

  • Learn Apache Spark Foundation and Spark architecture
  • Learn data engineering and data processing in Spark
  • Work with data sources and sinks
  • Use PyCharm IDE for Spark development and debugging
  • Learn unit testing, managing application logs, and cluster deployment


This course is designed for software engineers willing to develop a data engineering pipeline and application using Apache Spark; for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure, for managers and architects who do not directly work with Spark implementation but work with the people who implement Apache Spark at the ground level.

This course does not require any prior knowledge of Apache Spark or Hadoop; only programming knowledge using Python programming language is required.

About The Author

ScholarNest: ScholarNest is a small team of people passionate about helping others learn and grow in their careers by bridging the gap between their existing and required skills.

Together, they have over 40+ years of experience in IT as a developer, architect, consultant, trainer, and mentor. They have worked with international software services organizations on various data-centric and Big Data projects.

It is a team of firm believers in lifelong continuous learning and skill development. To popularize the importance of continuous learning, they started publishing free training videos on their YouTube channel. They conceptualized the notion of continuous learning, creating a journal of our learning under the Learning Journal banner.

Table of contents

  1. Chapter 1 : Apache Spark Introduction
    1. Big Data History and Primer
    2. Understanding the Data Lake Landscape
    3. What is Apache Spark - An Introduction and Overview
  2. Chapter 2 : Installing and Using Apache Spark
    1. Spark Development Environments
    2. Mac Users - Apache Spark in Local Mode Command Line REPL
    3. Windows Users - Apache Spark in Local Mode Command Line REPL
    4. Mac Users - Apache Spark in the IDE - PyCharm
    5. Windows Users - Apache Spark in the IDE - PyCharm
    6. Apache Spark in Cloud - Databricks Community and Notebooks
    7. Apache Spark in Anaconda - Jupyter Notebook
  3. Chapter 3 : Spark Execution Model and Architecture
    1. Execution Methods - How to Run Spark Programs?
    2. Spark Distributed Processing Model - How Your Program Runs?
    3. Spark Execution Modes and Cluster Managers
    4. Summarizing Spark Execution Models - When to Use What?
    5. Working with PySpark Shell - Demo
    6. Installing Multi-Node Spark Cluster - Demo
    7. Working with Notebooks in Cluster - Demo
    8. Working with Spark Submit - Demo
    9. Section Summary
  4. Chapter 4 : Spark Programming Model and Developer Experience
    1. Creating Spark Project Build Configuration
    2. Configuring Spark Project Application Logs
    3. Creating Spark Session
    4. Configuring Spark Session
    5. Data Frame Introduction
    6. Data Frame Partitions and Executors
    7. Spark Transformations and Actions
    8. Spark Jobs Stages and Task
    9. Understanding your Execution Plan
    10. Unit Testing Spark Application
    11. Rounding off Summary
  5. Chapter 5 : Spark Structured API Foundation
    1. Introduction to Spark APIs
    2. Introduction to Spark RDD API
    3. Working with Spark SQL
    4. Spark SQL Engine and Catalyst Optimizer
    5. Section Summary
  6. Chapter 6 : Spark Data Sources and Sinks
    1. Spark Data Sources and Sinks
    2. Spark DataFrameReader API
    3. Reading CSV, JSON and Parquet files
    4. Creating Spark DataFrame Schema
    5. Spark DataFrameWriter API
    6. Writing Your Data and Managing Layout
    7. Spark Databases and Tables
    8. Working with Spark SQL Tables
  7. Chapter 7 : Spark Dataframe and Dataset Transformations
    1. Introduction to Data Transformation
    2. Working with Dataframe Rows
    3. DataFrame Rows and Unit Testing
    4. Dataframe Rows and Unstructured data
    5. Working with Dataframe Columns
    6. Creating and Using UDF
    7. Misc Transformations
  8. Chapter 8 : Aggregations in Apache Spark
    1. Aggregating Dataframes
    2. Grouping Aggregations
    3. Windowing Aggregations
  9. Chapter 9 : Spark Dataframe Joins
    1. Dataframe Joins and Column Name Ambiguity
    2. Outer Joins in Dataframe
    3. Internals of Spark Join and shuffle
    4. Optimizing Your Joins
    5. Implementing Bucket Joins
  10. Chapter 10 : Keep Learning
    1. Final Word

Product information

  • Title: Spark Programming in Python for Beginners with Apache Spark 3
  • Author(s): ScholarNest
  • Release date: February 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781803246161