PySpark and AWS: Master Big Data with PySpark and AWS

Video description

The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.

Right through the course, you’ll be using PySpark to perform data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and Dataframes. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment to run the Spark scripts and explore it as well.

Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, computations, and how Spark can communicate with different AWS services and get its required data.

By the end of this course, you’ll be able to understand and implement the concepts of PySpark and AWS to solve real-world problems.

What You Will Learn

  • Learn the importance of Big Data
  • Explore the Spark and Hadoop architecture and ecosystem
  • Learn about PySpark Dataframes and PySpark DataFrames actions
  • Use PySpark DataFrames transformations
  • Apply collaborative filtering to develop a recommendation system using ALS models

Audience

This course requires python programming experience as a prerequisite.

About The Author

AI Sciences: AI Sciences are experts, PhDs, and artificial intelligence practitioners, including computer science, machine learning, and Statistics. Some work in big companies such as Amazon, Google, Facebook, Microsoft, KPMG, BCG, and IBM.

AI sciences produce a series of courses dedicated to beginners and newcomers on techniques and methods of machine learning, statistics, artificial intelligence, and data science. They aim to help those who wish to understand techniques more easily and start with less theory and less extended reading. Today, they publish more comprehensive courses on specific topics for wider audiences.

Their courses have successfully helped more than 100,000 students master AI and data science.

Table of contents

  1. Chapter 1 : Introduction
    1. Why Big Data
    2. Applications of PySpark
    3. Introduction to Instructor
    4. Introduction to Course
    5. Projects Overview
  2. Chapter 2 : Introduction to Hadoop, Spark Ecosystems and Architectures
    1. Why Spark
    2. Hadoop Ecosystem
    3. Spark Architecture and Ecosystem
    4. Databricks Sign Up
    5. Create Databricks Notebook
    6. Download Spark and Dependencies
    7. Java Setup
    8. Python Setup
    9. Spark Setup
    10. Hadoop Setup
    11. Running Spark
  3. Chapter 3 : Spark RDDs
    1. Spark RDDs
    2. Creating Spark RDD
    3. Running Spark Code Locally
    4. RDD Map (Lambda)
    5. RDD Map (Simple Function)
    6. Quiz (Map)
    7. Solution 1 (Map)
    8. Solution 2 (Map)
    9. RDD FlatMap
    10. RDD Filter
    11. Quiz (Filter)
    12. Solution (Filter)
    13. RDD Distinct
    14. RDD GroupByKey
    15. RDD ReduceByKey
    16. Quiz (Word Count)
    17. Solution (Word Count)
    18. RDD (Count and CountByValue)
    19. RDD (saveAsTextFile)
    20. RDD (Partition)
    21. Finding Average-1
    22. Finding Average-2
    23. Quiz (Average)
    24. Solution (Average)
    25. Finding Min and Max
    26. Quiz (Min and Max)
    27. Solution (Min and Max)
    28. Project Overview
    29. Total Students
    30. Total Marks by Male and Female Student
    31. Total Passed and Failed Students
    32. Total Enrollments per Course
    33. Total Marks per Course
    34. Average Marks per Course
    35. Finding Minimum and Maximum Marks
    36. Average Age of Male and Female Students
  4. Chapter 4 : Spark DFs
    1. Introduction to Spark DFs
    2. Creating Spark DFs
    3. Spark Infer Schema
    4. Spark Provide Schema
    5. Create DF from RDD
    6. Rectifying the Error
    7. Select DF Columns
    8. Spark DF with Column
    9. Spark DF with Column Renamed and Alias
    10. Spark DF Filter Rows
    11. Quiz (Select, Withcolumn, Filter)
    12. Solution (Select, Withcolumn, Filter)
    13. Spark DF (Count, Distinct, Duplicate)
    14. Quiz (Distinct, Duplicate)
    15. Solution (Distinct, Duplicate)
    16. Spark DF (Sort, OrderBy)
    17. Quiz (Sort, OrderBy)
    18. Solution (Sort, OrderBy)
    19. Spark DF (Group By)
    20. Spark DF (Group By - Multiple Columns and Aggregations)
    21. Spark DF (Group By -Visualization)
    22. Spark DF (Group By - Filtering)
    23. Quiz (Group By)
    24. Solution (Group By)
    25. Quiz (Word Count)
    26. Solution (Word Count)
    27. Spark DF (UDFs)
    28. Quiz (UDFs)
    29. Solution (UDFs)
    30. Solution (Cache and Persist)
    31. Spark DF (DF to RDD)
    32. Spark DF (Spark SQL)
    33. Spark DF (Write DF)
    34. Project Overview
    35. Project (Count and Select)
    36. Project (Group By)
    37. Project (Group By, Aggregations and Order By)
    38. Project (Filtering)
    39. Project (UDF and WithColumn)
    40. Project (Write)
  5. Chapter 5 : Collaborative Filtering
    1. Collaborative Filtering
    2. Utility Matrix
    3. Explicit and Implicit Ratings
    4. Expected Results
    5. Dataset
    6. Joining Dataframes
    7. Train and Test Data
    8. ALS Model
    9. Hyperparameter Tuning and Cross Validation
    10. Best Model and Evaluate Predictions
    11. Recommendations
  6. Chapter 6 : Spark Streaming
    1. Introduction to Spark Streaming
    2. Spark Streaming with RDD
    3. Spark Streaming Context
    4. Spark Streaming Reading Data
    5. Spark Streaming Cluster Restart
    6. Spark Streaming RDD Transformations
    7. Spark Streaming DF
    8. Spark Streaming Display
    9. Spark Streaming DF Aggregations
  7. Chapter 7 : ETL Pipeline
    1. Introduction to ETL
    2. ETL Pipeline Flow
    3. Data Set
    4. Extracting Data
    5. Transforming Data
    6. Loading Data (Creating RDS-I)
    7. Load data (Creating RDS-II)
    8. RDS Networking
    9. Downloading Postgres
    10. Installing Postgres
    11. Connect to RDS Through PGAdmin
    12. Loading Data
  8. Chapter 8 : Project - Change Data Capture / Replication Ongoing
    1. Introduction to Project
    2. Project Architecture
    3. Creating RDS MySQL Instance
    4. Creating S3 Bucket
    5. Creating DMS Source Endpoint
    6. Creating DMS Destination Endpoint
    7. Creating DMS Instance
    8. MySQL WorkBench
    9. Connecting with RDS and Dumping Data
    10. Querying RDS
    11. DMS Full Load
    12. DMS Replication Ongoing
    13. Stopping Instances
    14. Glue Job (Full Load)
    15. Glue Job (Change Capture)
    16. Glue Job (CDC)
    17. Creating Lambda Function and Adding Trigger
    18. Checking Trigger
    19. Getting S3 File Name in Lambda
    20. Creating Glue Job
    21. Adding Invoke for Glue Job
    22. Testing Invoke
    23. Writing Glue Shell Job
    24. Full Load Pipeline
    25. Change Data Capture Pipeline

Product information

  • Title: PySpark and AWS: Master Big Data with PySpark and AWS
  • Author(s): AI Sciences
  • Release date: September 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781803236698