on-demand course

Apache Spark 3 for Data Engineering and Analytics with Python

with David Mngadi

August 2021

Beginner to intermediate

8h 30m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Quizzes

Badge

Course outline

Introduction
4m 43s
The Spark Architecture
3m 40s
The Spark Unified Stack
3m 38s
Java Installation
6m 29s
Hadoop Installation
5m 27s
Python Installation
4m 23s
PySpark Installation
7m 56s
Install Microsoft Build Tools
2m 35s
MacOS - Java Installation
3m 45s
MacOS - Python Installation
4m 17s
MacOS - PySpark Installation
7m 17s
MacOS - Testing the Spark Installation
5m 8s
Install Jupyter Notebooks
9m 18s
The Spark Web UI
11m 20s
Section Summary
2m 33s
Section Introduction
1m 26s
Spark Application and Session
7m 55s
Spark Transformations and Actions Part 1
9m 13s
Spark Transformations and Actions Part 2
3m 34s
DAG Visualisation
5m 45s
Introduction to RDDs
4m 51s
Data Preparation
6m 40s
Distinct and Filter Transformations
8m 17s
Map and Flat Map Transformations
7m 16s
SortByKey Transformations
6m 20s
RDD Actions
8m 22s
Challenge - Convert Fahrenheit to Centigrade
9m 5s
Challenge - XYZ Research
2m 19s
Challenge - XYZ Research Part 1
6m 10s
Challenge XYZ Research Part 2
4m 41s
Structured APIs Introduction
6m 24s
Preparing the Project Folder
5m 29s
PySpark DataFrame, Schema, and DataTypes
8m 50s
DataFrame Reader and Writer
9m 28s
Challenge Part 1 - Brief
2m 42s
Challenge Part 1 - Data Preparation
8m 54s
Working with Structured Operations
3m 8s
Managing Performance Errors
4m 52s
Reading a JSON File
10m 18s
Columns and Expressions
8m 37s
Filter and Where Conditions
6m 35s
Distinct Drop Duplicates Order By
7m 15s
Rows and Union
7m 42s
Adding, Renaming, and Dropping Columns
6m 23s
Working with Missing or Bad Data
8m 8s
Working with User-Defined Functions
8m 14s
Challenge Part 2 - Brief
5m 12s
Challenge Part 2 - Remove Null Row and Bad Records
8m 4s
Challenge Part 2 - Get the City and State
8m 21s
Challenge Part 2 - Rearrange the Schema
9m 16s
Challenge Part 2 - Write Partitioned DataFrame to Parquet
6m 0s
Aggregations
2m 29s
Aggregations - Setting Up Flight Summary Data
6m 13s
Aggregations - Count and Count Distinct
6m 4s
Aggregations - Min Max Sum SumDistinct AVG
6m 40s
Aggregations with Grouping
7m 51s
Challenge Part 3 - Brief
2m 1s
Challenge Part 3 - Prepare 2019 Data
6m 5s
Challenge Part 3 - Q1 Get the Best Sales Month
10m 41s
Challenge Part 3 - Q2 Get the City that Sold the Most Products
4m 58s
Challenge Part 3 - Q3 When to Advertise
10m 25s
Challenge Part 3 - Q4 Products Bought Together
9m 30s
Introduction to DataBricks
4m 29s
Spark SQL Introduction
3m 49s
Register Account on Databricks
3m 7s
Create a Databricks Cluster
3m 52s
Creating our First 2 Databricks Notebooks
5m 28s
Reading CSV Files into DataFrame
8m 57s
Creating a Database and Table
7m 45s
Inserting Records into a Table
9m 27s
Exposing Bad Records
5m 39s
Figuring out How to Remove Bad Records
4m 31s
Extract the City and State
8m 44s
Inserting Records to Final Sales Table
14m 56s
What was the Best Month in Sales?
9m 15s
Get the City that Sold the Most Products
2m 56s
Get the Right Time to Advertise
4m 45s
Get the Most Products Sold Together
9m 44s
Create a Dashboard
3m 23s
Summary
2m 32s

Overview

In this 8-hour course, you will explore the fundamentals of Apache Spark 3 using Python for data engineering and analytics. From learning the essentials of PySpark to applying it in Databricks for creating powerful data solutions, this course guides you towards mastering data processing and analysis at scale.

What I will be able to do after this course

Understand and utilize Spark's structured and RDD APIs for data transformations and actions.
Set up and configure your own local PySpark environment for effective Spark development.
Grasp concepts of Spark execution and its Directed Acyclic Graphs (DAG).
Learn to use Spark SQL and DataFrames for data manipulation and queries.
Develop dashboards and visualizations on Databricks for insightful analytics.

Course Instructor(s)

David Mngadi is an experienced data engineer and instructor, proficient in working with technologies like Python and Apache Spark. With a passion for teaching, he constructs engaging, hands-on courses that empower learners to achieve confidence in data analysis. David's approachable style ensures principles are understood clearly and practically.

Who is it for?

This course is ideal for Python developers seeking to branch into data engineering and analytics with PySpark. Aspiring data professionals and analysts with foundational programming knowledge will benefit greatly. Data scientists looking to scale their analysis for big data applications are also welcome. Enthusiasts in engineering tasks over distributed systems will find it engaging and rewarding.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Apache Spark with Python - Big Data with PySpark and Spark

Publisher Resources

ISBN: 9781803244303

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills