on-demand course

Mastering Big Data Analytics with PySpark

with Danny Meijer

June 2020

Beginner to intermediate

8h 7m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Badge

Course outline

Course Overview
6m 51s
Python versus Spark
10m 35s
Preparing for the Course
6m 23s
Connecting Jupyter to Spark
14m 58s
Getting to Know Spark
7m 27s
The Power of Spark
6m 39s
The Power of Spark MLlib
6m 35s
Spark DataFrames
10m 45s
Spark Data Operations
11m 33s
Loading Data from CSV Files
11m 12s
Fixing Issues in Our Data – Part One
10m 51s
Fixing Issues in Our Data – Part Two
10m 45s
Grouping, Joining, and Aggregating – Part One
16m 8s
Grouping, Joining, and Aggregating – Part Two
9m 3s
Machine Learning with Spark
9m 51s
Building a Recommendation System with Spark MLlib – Part One
11m 12s
Building a Recommendation System with Spark MLlib – Part Two
11m 22s
Building a Recommendation System with Spark MLlib – Part Three
16m 19s
Finalizing our Recommendation System
15m 37s
What We Have Learned So Far
10m 13s
Machine Learning with Spark
21m 15s
Machine Learning Pipelines
11m 25s
Running a Logistic Regression Pipeline
11m 42s
Parameters, Features, and Persistence
15m 28s
Frequent Pattern Mining and Statistics
22m 0s
Natural Language Processing with Spark
12m 17s
Identifying Our Data
11m 38s
Data Preparation and Exploration
11m 38s
Creating Our Raw Training Data
10m 14s
Data Preparation and Regular Expressions
15m 28s
Data Cleaning and Transformation
19m 2s
Training a Sentiment Analysis Model – Part One
15m 34s
Training a Sentiment Analysis Model – Part Two
9m 33s
Fetching Data from Twitter
6m 24s
Spark Structured Streaming
11m 23s
Managing and Converting Streams
12m 49s
Assembling Our Streaming ML Solution
17m 5s
A Structured Approach to ML Streaming
2m 19s
Running Spark in Production
10m 44s
Running Spark at Scale
10m 2s
Tips, Tricks, and Take-Aways
14m 59s

Overview

In this 8 hr course, you'll learn to leverage PySpark for efficient big data analytics and machine learning. By mastering PySpark's libraries, you'll be equipped to process large-scale datasets and build scalable data pipelines.

What I will be able to do after this course

Understand and apply PySpark components such as Spark SQL and MLlib.
Develop scalable machine learning pipelines using PySpark.
Perform exploratory data analysis on large datasets confidently.
Learn effective strategies for data visualization in Jupyter with PySpark.
Gain scalable solutions for big data challenges.

Course Instructor(s)

Danny Meijer is a seasoned expert in big data technologies and a passionate instructor. With years of hands-on experience in PySpark and a knack for making complex concepts accessible, Danny has empowered countless learners. His engaging style and emphasis on practical use cases offer a learning experience that is both enjoyable and effective.

Who is it for?

This course is ideal for data scientists who want to scale their analyses, data engineers aiming to refine big data workflows, and Python developers eager to dive into big data analytics. Learners should have a basic understanding of Python programming and some foundational knowledge of machine learning to benefit the most.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781838640583

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Mastering Big Data Analytics with PySpark

with Danny Meijer

Chapter 1 : Python and Spark: A Match Made in Heaven

Chapter 2 : Working with PySpark

Chapter 3 : Preparing Data Using Spark SQL

Chapter 4 : Machine Learning with Spark MLlib

Chapter 5 : Classification and Regression

Chapter 6 : Analyzing Big Data

Chapter 7 : Processing Natural Language in Spark

Chapter 8 : Machine Learning in Real-Time

Chapter 9 : The Power of PySpark