Apache Spark 3 for Data Engineering and Analytics with Python

Video description

Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem.

You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames.

There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course.

What You Will Learn

  • Learn Spark architecture, transformations, and actions using the structured API
  • Learn to set up your own local PySpark environment
  • Learn to interpret DAG (Directed Acyclic Graph) for Spark execution
  • Learn to interpret the Spark web UI
  • Learn the RDD (Resilient Distributed Datasets) API
  • Learn to visualize (graphs and dashboards) data on Databricks


This course is designed for Python developers who wish to learn how to use the language for data engineering and analytics with PySpark. Any aspiring data engineering and analytics professionals. Data scientists/analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster. Data managers who want to gain a deeper understanding of managing data over a cluster.

About The Author

David Mngadi: David Mngadi is a data management professional who is influenced by the power of data in our lives and has helped several companies become more data-driven to gain a competitive edge as well as meet the regulatory requirements. In the last 15 years, he has had the pleasure of designing and implementing data warehousing solutions in retail, telco, and banking industries, and recently in more big data lake-specific implementations. He is passionate about technology and teaching programming online.

Table of contents

  1. Chapter 1 : Introduction to Spark and Installation
    1. Introduction
    2. The Spark Architecture
    3. The Spark Unified Stack
    4. Java Installation
    5. Hadoop Installation
    6. Python Installation
    7. PySpark Installation
    8. Install Microsoft Build Tools
    9. MacOS - Java Installation
    10. MacOS - Python Installation
    11. MacOS - PySpark Installation
    12. MacOS - Testing the Spark Installation
    13. Install Jupyter Notebooks
    14. The Spark Web UI
    15. Section Summary
  2. Chapter 2 : Spark Execution Concepts
    1. Section Introduction
    2. Spark Application and Session
    3. Spark Transformations and Actions Part 1
    4. Spark Transformations and Actions Part 2
    5. DAG Visualisation
  3. Chapter 3 : RDD Crash Course
    1. Introduction to RDDs
    2. Data Preparation
    3. Distinct and Filter Transformations
    4. Map and Flat Map Transformations
    5. SortByKey Transformations
    6. RDD Actions
    7. Challenge - Convert Fahrenheit to Centigrade
    8. Challenge - XYZ Research
    9. Challenge - XYZ Research Part 1
    10. Challenge XYZ Research Part 2
  4. Chapter 4 : Structured API - Spark DataFrame
    1. Structured APIs Introduction
    2. Preparing the Project Folder
    3. PySpark DataFrame, Schema, and DataTypes
    4. DataFrame Reader and Writer
    5. Challenge Part 1 – Brief
    6. Challenge Part 1 - Data Preparation
    7. Working with Structured Operations
    8. Managing Performance Errors
    9. Reading a JSON File
    10. Columns and Expressions
    11. Filter and Where Conditions
    12. Distinct Drop Duplicates Order By
    13. Rows and Union
    14. Adding, Renaming, and Dropping Columns
    15. Working with Missing or Bad Data
    16. Working with User-Defined Functions
    17. Challenge Part 2 – Brief
    18. Challenge Part 2 - Remove Null Row and Bad Records
    19. Challenge Part 2 - Get the City and State
    20. Challenge Part 2 - Rearrange the Schema
    21. Challenge Part 2 - Write Partitioned DataFrame to Parquet
    22. Aggregations
    23. Aggregations - Setting Up Flight Summary Data
    24. Aggregations - Count and Count Distinct
    25. Aggregations - Min Max Sum SumDistinct AVG
    26. Aggregations with Grouping
    27. Challenge Part 3 – Brief
    28. Challenge Part 3 - Prepare 2019 Data
    29. Challenge Part 3 - Q1 Get the Best Sales Month
    30. Challenge Part 3 - Q2 Get the City that Sold the Most Products
    31. Challenge Part 3 - Q3 When to Advertise
    32. Challenge Part 3 - Q4 Products Bought Together
  5. Chapter 5 : Introduction to Spark SQL and Databricks
    1. Introduction to DataBricks
    2. Spark SQL Introduction
    3. Register Account on Databricks
    4. Create a Databricks Cluster
    5. Creating our First 2 Databricks Notebooks
    6. Reading CSV Files into DataFrame
    7. Creating a Database and Table
    8. Inserting Records into a Table
    9. Exposing Bad Records
    10. Figuring out How to Remove Bad Records
    11. Extract the City and State
    12. Inserting Records to Final Sales Table
    13. What was the Best Month in Sales?
    14. Get the City that Sold the Most Products
    15. Get the Right Time to Advertise
    16. Get the Most Products Sold Together
    17. Create a Dashboard
    18. Summary

Product information

  • Title: Apache Spark 3 for Data Engineering and Analytics with Python
  • Author(s): David Mngadi
  • Release date: August 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781803244303