on-demand course

PySpark and AWS: Master Big Data with PySpark and AWS

with AI Sciences

September 2021

Beginner to intermediate

16h 11m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Includes Quizzes

Earns Badge

Course outline

Why Big Data
3m 11s
Applications of PySpark
3m 13s
Introduction to Instructor
46s
Introduction to Course
1m 50s
Projects Overview
3m 35s
Why Spark
3m 54s
Hadoop Ecosystem
4m 50s
Spark Architecture and Ecosystem
8m 8s
Databricks Sign Up
3m 42s
Create Databricks Notebook
4m 52s
Download Spark and Dependencies
3m 17s
Java Setup
4m 17s
Python Setup
1m 32s
Spark Setup
2m 59s
Hadoop Setup
2m 41s
Running Spark
2m 50s
Spark RDDs
8m 28s
Creating Spark RDD
11m 1s
Running Spark Code Locally
10m 17s
RDD Map (Lambda)
11m 7s
RDD Map (Simple Function)
9m 38s
Quiz (Map)
1m 24s
Solution 1 (Map)
6m 37s
Solution 2 (Map)
3m 50s
RDD FlatMap
10m 13s
RDD Filter
8m 3s
Quiz (Filter)
1m 37s
Solution (Filter)
16m 8s
RDD Distinct
6m 24s
RDD GroupByKey
17m 2s
RDD ReduceByKey
13m 37s
Quiz (Word Count)
1m 3s
Solution (Word Count)
15m 4s
RDD (Count and CountByValue)
7m 12s
RDD (saveAsTextFile)
15m 14s
RDD (Partition)
18m 6s
Finding Average-1
15m 4s
Finding Average-2
7m 10s
Quiz (Average)
1m 29s
Solution (Average)
11m 10s
Finding Min and Max
10m 18s
Quiz (Min and Max)
58s
Solution (Min and Max)
6m 5s
Project Overview
2m 26s
Total Students
3m 40s
Total Marks by Male and Female Student
6m 52s
Total Passed and Failed Students
4m 49s
Total Enrollments per Course
5m 6s
Total Marks per Course
3m 13s
Average Marks per Course
12m 45s
Finding Minimum and Maximum Marks
3m 51s
Average Age of Male and Female Students
5m 58s
Introduction to Spark DFs
8m 9s
Creating Spark DFs
10m 35s
Spark Infer Schema
7m 48s
Spark Provide Schema
8m 29s
Create DF from RDD
8m 21s
Rectifying the Error
5m 18s
Select DF Columns
11m 50s
Spark DF with Column
19m 46s
Spark DF with Column Renamed and Alias
6m 13s
Spark DF Filter Rows
16m 5s
Quiz (Select, Withcolumn, Filter)
1m 27s
Solution (Select, Withcolumn, Filter)
10m 20s
Spark DF (Count, Distinct, Duplicate)
10m 56s
Quiz (Distinct, Duplicate)
45s
Solution (Distinct, Duplicate)
5m 8s
Spark DF (Sort, OrderBy)
6m 25s
Quiz (Sort, OrderBy)
1m 55s
Solution (Sort, OrderBy)
8m 54s
Spark DF (Group By)
12m 31s
Spark DF (Group By - Multiple Columns and Aggregations)
10m 38s
Spark DF (Group By -Visualization)
13m 27s
Spark DF (Group By - Filtering)
11m 8s
Quiz (Group By)
52s
Solution (Group By)
7m 38s
Quiz (Word Count)
54s
Solution (Word Count)
4m 30s
Spark DF (UDFs)
8m 34s
Quiz (UDFs)
1m 30s
Solution (UDFs)
7m 57s
Solution (Cache and Persist)
7m 30s
Spark DF (DF to RDD)
7m 24s
Spark DF (Spark SQL)
6m 16s
Spark DF (Write DF)
10m 45s
Project Overview
2m 11s
Project (Count and Select)
4m 11s
Project (Group By)
4m 26s
Project (Group By, Aggregations and Order By)
5m 4s
Project (Filtering)
8m 20s
Project (UDF and WithColumn)
6m 12s
Project (Write)
3m 17s
Collaborative Filtering
2m 31s
Utility Matrix
4m 4s
Explicit and Implicit Ratings
4m 16s
Expected Results
3m 10s
Dataset
6m 31s
Joining Dataframes
6m 43s
Train and Test Data
6m 27s
ALS Model
5m 57s
Hyperparameter Tuning and Cross Validation
8m 25s
Best Model and Evaluate Predictions
4m 13s
Recommendations
10m 53s
Introduction to Spark Streaming
4m 46s
Spark Streaming with RDD
4m 26s
Spark Streaming Context
5m 10s
Spark Streaming Reading Data
5m 19s
Spark Streaming Cluster Restart
4m 0s
Spark Streaming RDD Transformations
7m 41s
Spark Streaming DF
8m 23s
Spark Streaming Display
5m 15s
Spark Streaming DF Aggregations
5m 35s
Introduction to ETL
4m 59s
ETL Pipeline Flow
2m 20s
Data Set
2m 35s
Extracting Data
3m 21s
Transforming Data
14m 11s
Loading Data (Creating RDS-I)
9m 8s
Load data (Creating RDS-II)
2m 50s
RDS Networking
5m 31s
Downloading Postgres
1m 16s
Installing Postgres
1m 49s
Connect to RDS Through PGAdmin
2m 36s
Loading Data
15m 50s
Introduction to Project
1m 49s
Project Architecture
15m 31s
Creating RDS MySQL Instance
9m 27s
Creating S3 Bucket
3m 33s
Creating DMS Source Endpoint
5m 38s
Creating DMS Destination Endpoint
5m 36s
Creating DMS Instance
2m 41s
MySQL WorkBench
1m 15s
Connecting with RDS and Dumping Data
6m 6s
Querying RDS
1m 57s
DMS Full Load
8m 30s
DMS Replication Ongoing
6m 4s
Stopping Instances
1m 46s
Glue Job (Full Load)
8m 30s
Glue Job (Change Capture)
3m 50s
Glue Job (CDC)
15m 27s
Creating Lambda Function and Adding Trigger
6m 46s
Checking Trigger
5m 22s
Getting S3 File Name in Lambda
4m 29s
Creating Glue Job
5m 24s
Adding Invoke for Glue Job
4m 49s
Testing Invoke
4m 59s
Writing Glue Shell Job
5m 52s
Full Load Pipeline
6m 38s
Change Data Capture Pipeline
7m 12s

Overview

In this 16-hour course, learn how to harness the power of PySpark and AWS for big data projects. You'll cover everything from foundational concepts to advanced techniques, explore Spark's architecture, and implement data processing and machine learning workflows, all while utilizing the AWS cloud.

What I will be able to do after this course

Understand the Spark and Hadoop architectures and ecosystems.
Work with Spark RDDs and Dataframes for effective data processing.
Integrate PySpark workflows into AWS services for cloud-based solutions.
Build machine learning pipelines using PySpark for real-world applications.
Develop ETL and data engineering solutions using PySpark on big data.

Course Instructor(s)

Led by the experienced instructors at AI Sciences, this course aims to provide practical, hands-on programming expertise. The instructors combine academic knowledge with industry experience in data engineering to deliver a focused, project-based learning experience in big data technologies.

Who is it for?

This course is ideal for software engineers, data scientists, and aspiring big data engineers with prior experience in Python programming. Learners interested in gaining cloud computing expertise and understanding real-world applications of PySpark would greatly benefit from this course.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Amazon Web Services AWS LiveLessons 2nd Edition

Publisher Resources

ISBN: 9781803236698Supplemental Content

PySpark and AWS: Master Big Data with PySpark and AWS

with AI Sciences

Chapter 1 : Introduction

Chapter 2 : Introduction to Hadoop, Spark Ecosystems and Architectures

Chapter 3 : Spark RDDs

Chapter 4 : Spark DFs

Chapter 5 : Collaborative Filtering

Chapter 6 : Spark Streaming

Chapter 7 : ETL Pipeline

Chapter 8 : Project - Change Data Capture / Replication Ongoing

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Amazon Web Services AWS LiveLessons 2nd Edition

Serverless ETL and Analytics with AWS Glue

Amazon Web Services (AWS), 3rd Edition

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Publisher Resources

Chapter 1 : Introduction

Chapter 2 : Introduction to Hadoop, Spark Ecosystems and Architectures

Chapter 3 : Spark RDDs

Chapter 4 : Spark DFs

Chapter 5 : Collaborative Filtering

Chapter 6 : Spark Streaming

Chapter 7 : ETL Pipeline

Chapter 8 : Project - Change Data Capture / Replication Ongoing

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Amazon Web Services AWS LiveLessons 2nd Edition

Serverless ETL and Analytics with AWS Glue

Amazon Web Services (AWS), 3rd Edition

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.