Video description
The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.
Right through the course, you’ll be using PySpark to perform data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and Dataframes. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment to run the Spark scripts and explore it as well.
Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, computations, and how Spark can communicate with different AWS services and get its required data.
By the end of this course, you’ll be able to understand and implement the concepts of PySpark and AWS to solve real-world problems.
What You Will Learn
- Learn the importance of Big Data
- Explore the Spark and Hadoop architecture and ecosystem
- Learn about PySpark Dataframes and PySpark DataFrames actions
- Use PySpark DataFrames transformations
- Apply collaborative filtering to develop a recommendation system using ALS models
Audience
This course requires python programming experience as a prerequisite.
About The Author
AI Sciences: AI Sciences are experts, PhDs, and artificial intelligence practitioners, including computer science, machine learning, and Statistics. Some work in big companies such as Amazon, Google, Facebook, Microsoft, KPMG, BCG, and IBM.
AI sciences produce a series of courses dedicated to beginners and newcomers on techniques and methods of machine learning, statistics, artificial intelligence, and data science. They aim to help those who wish to understand techniques more easily and start with less theory and less extended reading. Today, they publish more comprehensive courses on specific topics for wider audiences.
Their courses have successfully helped more than 100,000 students master AI and data science.
Table of contents
- Chapter 1 : Introduction
- Chapter 2 : Introduction to Hadoop, Spark Ecosystems and Architectures
-
Chapter 3 : Spark RDDs
- Spark RDDs
- Creating Spark RDD
- Running Spark Code Locally
- RDD Map (Lambda)
- RDD Map (Simple Function)
- Quiz (Map)
- Solution 1 (Map)
- Solution 2 (Map)
- RDD FlatMap
- RDD Filter
- Quiz (Filter)
- Solution (Filter)
- RDD Distinct
- RDD GroupByKey
- RDD ReduceByKey
- Quiz (Word Count)
- Solution (Word Count)
- RDD (Count and CountByValue)
- RDD (saveAsTextFile)
- RDD (Partition)
- Finding Average-1
- Finding Average-2
- Quiz (Average)
- Solution (Average)
- Finding Min and Max
- Quiz (Min and Max)
- Solution (Min and Max)
- Project Overview
- Total Students
- Total Marks by Male and Female Student
- Total Passed and Failed Students
- Total Enrollments per Course
- Total Marks per Course
- Average Marks per Course
- Finding Minimum and Maximum Marks
- Average Age of Male and Female Students
-
Chapter 4 : Spark DFs
- Introduction to Spark DFs
- Creating Spark DFs
- Spark Infer Schema
- Spark Provide Schema
- Create DF from RDD
- Rectifying the Error
- Select DF Columns
- Spark DF with Column
- Spark DF with Column Renamed and Alias
- Spark DF Filter Rows
- Quiz (Select, Withcolumn, Filter)
- Solution (Select, Withcolumn, Filter)
- Spark DF (Count, Distinct, Duplicate)
- Quiz (Distinct, Duplicate)
- Solution (Distinct, Duplicate)
- Spark DF (Sort, OrderBy)
- Quiz (Sort, OrderBy)
- Solution (Sort, OrderBy)
- Spark DF (Group By)
- Spark DF (Group By - Multiple Columns and Aggregations)
- Spark DF (Group By -Visualization)
- Spark DF (Group By - Filtering)
- Quiz (Group By)
- Solution (Group By)
- Quiz (Word Count)
- Solution (Word Count)
- Spark DF (UDFs)
- Quiz (UDFs)
- Solution (UDFs)
- Solution (Cache and Persist)
- Spark DF (DF to RDD)
- Spark DF (Spark SQL)
- Spark DF (Write DF)
- Project Overview
- Project (Count and Select)
- Project (Group By)
- Project (Group By, Aggregations and Order By)
- Project (Filtering)
- Project (UDF and WithColumn)
- Project (Write)
- Chapter 5 : Collaborative Filtering
- Chapter 6 : Spark Streaming
- Chapter 7 : ETL Pipeline
-
Chapter 8 : Project - Change Data Capture / Replication Ongoing
- Introduction to Project
- Project Architecture
- Creating RDS MySQL Instance
- Creating S3 Bucket
- Creating DMS Source Endpoint
- Creating DMS Destination Endpoint
- Creating DMS Instance
- MySQL WorkBench
- Connecting with RDS and Dumping Data
- Querying RDS
- DMS Full Load
- DMS Replication Ongoing
- Stopping Instances
- Glue Job (Full Load)
- Glue Job (Change Capture)
- Glue Job (CDC)
- Creating Lambda Function and Adding Trigger
- Checking Trigger
- Getting S3 File Name in Lambda
- Creating Glue Job
- Adding Invoke for Glue Job
- Testing Invoke
- Writing Glue Shell Job
- Full Load Pipeline
- Change Data Capture Pipeline
Product information
- Title: PySpark and AWS: Master Big Data with PySpark and AWS
- Author(s):
- Release date: September 2021
- Publisher(s): Packt Publishing
- ISBN: 9781803236698
You might also like
book
Serverless ETL and Analytics with AWS Glue
Build efficient data lakes that can scale to virtually unlimited size using AWS Glue Key Features …
video
Data Engineering with Python and AWS Lambda LiveLessons
7 Hours of Video Instruction Data Engineering with Python and AWS Lambda LiveLessons shows users how …
video
Snowflake - Build and Architect Data Pipelines Using AWS
Snowflake is the next big thing, and it is becoming a full-blown data ecosystem. With the …
book
Data Science on AWS
With this practical book, AI and machine learning practitioners will learn how to successfully build and …