on-demand course

Apache Spark with Python - Big Data with PySpark and Spark

with James Lee

April 2018

Beginner to intermediate

3h 18m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Earns Badge

Course outline

Course Overview
4m 9s
Introduction to Spark
2m 28s
Install Java and Git
8m 53s
Set up Spark
9m 22s
Run our first Spark job
3m 49s
RDD Basics
2m 50s
Create RDDs
2m 33s
Map and Filter Transformation
9m 29s
Solution to Airports by Latitude Problem
1m 58s
FlatMap Transformation
3m 46s
Set Operations
8m 26s
Solution for the Same Hosts Problem
1m 55s
Actions
9m 3s
Solution to Sum of Numbers Problem
2m 7s
Important Aspects about RDD
1m 40s
Summary of RDD Operations
2m 26s
Caching and Persistance
5m 16s
Spark Architecture
3m 1s
Spark Components
5m 26s
Introduction to Pair RDD
1m 38s
Create Pair RDDs
4m 15s
Filter and MapValue Transformations on Pair RDD
5m 17s
Reduce By Key Aggregation
5m 39s
Solution for the Average House Problem
3m 25s
Group By Key Transformation
5m 15s
Sort By Key Transformation
2m 51s
Solution for the Sorted Word Count Problem
3m 24s
Data Partitioning
4m 19s
Join Operations
5m 13s
Accumulators
3m 44s
Solution to StackOverflow Survey Follow-up Problem
1m 6s
Broadcast Variables
6m 46s
Introduction to Spark SQL
3m 56s
Spark SQL in Action
13m 12s
Spark SQL practice: House Price Problem
1m 54s
Spark SQL Joins
7m 4s
Dataframe or RDD
2m 57s
Dataframe and RDD Conversion
2m 55s
Performance Tuning of Spark SQL
2m 52s
Introduction to Running Spark in a Cluster
4m 6s
Spark-submit
2m 41s
Run Spark Application on Amazon EMR (ElasticMapReduce) cluster
15m 10s

Overview

In this 3-hour course, learn how to harness the power of Apache Spark with Python using PySpark. Through practical, hands-on examples, you'll master big data processing and analytics techniques, enabling you to craft scalable applications and gain insights from diverse datasets.

What I will be able to do after this course

Understand the architecture of Apache Spark and its components.
Learn to work with Resilient Distributed Datasets (RDDs) and perform transformations and actions.
Develop skills in processing structured and semi-structured data using Spark SQL and DataFrames.
Explore optimization techniques such as caching and partitioning to enhance Spark application performance.
Gain experience deploying and scaling Spark applications on platforms like Hadoop YARN and Amazon EMR.

Course Instructor(s)

Your instructor, James Lee, is a seasoned software engineer with over a decade of experience in big data technologies. He specializes in simplifying complex subjects by creating engaging and accessible video tutorials. With a practical approach to teaching, James equips students with the skills and confidence to tackle real-world data challenges.

Who is it for?

This course is perfect for software engineers looking to develop their expertise in big data technologies, as well as data scientists and engineers aiming to enhance their data processing proficiency. It's suitable for learners with basic programming experience who aspire to work with Apache Spark and gain hands-on experience building data-driven solutions. Aspiring big data professionals will find the content insightful and career-enriching.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Apache Spark Streaming with Python and PySpark

Publisher Resources

ISBN: 9781789133394Supplemental Content

Apache Spark with Python - Big Data with PySpark and Spark

with James Lee

Chapter 1 : Get Started with Apache Spark

Chapter 2 : RDD

Chapter 3 : Spark Architecture and Components

Chapter 4 : Pair RDD

Chapter 5 : Advanced Spark Topics

Chapter 6 : Spark SQL

Chapter 7 : Running Spark in a Cluster

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Apache Spark Streaming with Python and PySpark

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Spark Programming in Python for Beginners with Apache Spark 3

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Publisher Resources

Chapter 1 : Get Started with Apache Spark

Chapter 2 : RDD

Chapter 3 : Spark Architecture and Components

Chapter 4 : Pair RDD

Chapter 5 : Advanced Spark Topics

Chapter 6 : Spark SQL

Chapter 7 : Running Spark in a Cluster

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Apache Spark Streaming with Python and PySpark

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Spark Programming in Python for Beginners with Apache Spark 3

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.