on-demand course

Apache Spark with Scala – Hands-On with Big Data!

with Frank Kane

October 2022

Beginner to intermediate

8h 56m

English

Packt Publishing

Closed Captioning available in English

Watch now

Unlock full access

Includes

Badge

Course outline

Introduction and Installing the Course Materials, IntelliJ, and Scala
15m 45s
Introduction to Apache Spark
14m 27s
(Activity) Scala Basics
25m 49s
(Exercise) Flow Control in Scala
9m 31s
(Exercise) Functions in Scala
9m 10s
(Exercise) Data Structures in Scala
22m 31s
The Resilient Distributed Dataset
11m 32s
Ratings Histogram Example
11m 28s
Spark Internals
1m 59s
Key / Value RDDs, and the Average Friends by Age Example
10m 42s
(Activity) Running the Average Friends by Age Example
4m 53s
Filtering RDDs, and the Minimum Temperature by Location Example
5m 57s
(Activity) Running the Minimum Temperature Example, and Modifying It for Maximum
11m 38s
(Activity) Counting Word Occurrences Using Flatmap()
5m 49s
(Activity) Improving the Word Count Script with Regular Expressions
3m 47s
(Activity) Sorting the Word Count Results
6m 38s
(Exercise) Find the Total Amount Spent by Customer
4m 33s
(Exercise) Check Your Results and Sort Them by Total Amount Spent
5m 12s
Check Your Results and Implementation Against Mine
3m 2s
Introduction to SparkSQL
9m 47s
(Activity) Using SparkSQL
7m 8s
(Activity) Using DataSets
8m 36s
(Exercise) Implement the Friends by Age Example Using DataSets
2m 42s
Exercise Solution: Friends by Age, with DataSets
7m 25s
(Activity) Word Count Example Using DataSets
10m 40s
(Activity) Revisiting the Minimum Temperature Example, with DataSets
9m 3s
(Exercise) Implement the Total Spent by Customer Problem with DataSets
2m 12s
Exercise Solution: Total Spent by Customer with DataSets
6m 31s
(Activity) Find the Most Popular Movie
5m 26s
(Activity) Use Broadcast Variables to Display Movie Names
11m 22s
(Activity) Find the Most Popular Superhero in a Social Graph
12m 20s
(Exercise) Find the Most Obscure Superheroes
5m 17s
Exercise Solution: Find the Most Obscure Superheroes
6m 46s
Superhero Degrees of Separation: Introducing Breadth-First Search
7m 18s
Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark
8m 2s
(Activity) Superhero Degrees of Separation: Review the Code and Run It!
12m 58s
Item-Based Collaborative Filtering in Spark, cache(), and persist()
8m 2s
(Activity) Running the Similar Movies Script Using Spark's Cluster Manager
14m 51s
(Exercise) Improve the Quality of Similar Movies
3m 57s
(Activity) Using spark-submit to Run Spark Driver Scripts
11m 46s
(Activity) Packaging Driver Scripts with SBT
15m 9s
(Exercise) Package a Script with SBT and Run It Locally with spark-submit
2m 6s
Exercise Solution: Using SBT and spark-submit
9m 7s
Introducing Amazon Elastic MapReduce
7m 14s
Creating Similar Movies from One Million Ratings on EMR
11m 35s
Partitioning
4m 20s
Best Practices for Running on a Cluster
6m 28s
Troubleshooting and Managing Dependencies
11m 1s
Introducing MLLib
9m 57s
(Activity) Using MLLib to Produce Movie Recommendations
12m 44s
Linear Regression with MLLib
7m 1s
(Activity) Running a Linear Regression with Spark
7m 49s
(Exercise) Predict Real Estate Values with Decision Trees in Spark
4m 58s
Exercise Solution: Predicting Real Estate with Decision Trees in Spark
5m 50s
The DStream API for Spark Streaming
11m 29s
(Activity) Real-Time Monitoring of the Most Popular Hashtags on Twitter
8m 53s
Structured Streaming
4m 6s
(Activity) Using Structured Streaming for Real-Time Log Analysis
5m 35s
(Exercise) Windowed Operations with Structured Streaming
6m 6s
Exercise Solution: Top URLs in a 30-Second Window
5m 46s
GraphX, Pregel, and Breadth-First Search with Pregel
6m 53s
Using the Pregel API with Spark GraphX
4m 32s
(Activity) Superhero Degrees of Separation Using GraphX
7m 10s
Learning More, and Career Tips
4m 19s

Overview

In this 8 hr course, you will learn to analyze large datasets using Apache Spark with Scala, from understanding the basics to building hands-on projects. Through practical exercises and examples, you will master tools like RDDs, DataFrames, and EMR, empowering you to process big data efficiently.

What I will be able to do after this course

Master the fundamentals of Apache Spark and Scala for big data processing.
Develop distributed data processing scripts using Spark's RDD and DataFrame APIs.
Gain hands-on experience with deploying Spark applications on a Hadoop cluster using EMR.
Learn to implement real-time data analysis workflows using Spark Streaming and MLlib.
Understand the application of Spark for handling and analyzing large-scale graph data.

Course Instructor(s)

Frank Kane, an ex-Amazon and IMDb senior engineer, brings his expertise in scalable systems and big data to this course. With years of teaching experience, he simplifies complex topics into manageable lessons, blending theoretical insights with practical implementation.

Who is it for?

This course is perfect for software developers with some programming knowledge looking to broaden their expertise in big data processing. Ideal for engineers seeking hands-on practice with Spark and Scala, the content addresses those aiming to handle scalable data workflows for professional applications.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical Apache Spark: Using the Scala API

Publisher Resources

ISBN: 9781787129849

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Spark with Scala – Hands-On with Big Data!

with Frank Kane

Chapter 1 : Getting Started

Chapter 2 : Scala Crash Course (Optional)

Chapter 3 : Using Resilient Distributed Datasets (RDDs)

Chapter 4 : SparkSQL, DataFrames, and DataSets

Chapter 5 : Advanced Examples of Spark Programs

Chapter 6 : Running Spark on a Cluster

Chapter 7 : Machine Learning with Spark ML

Chapter 8 : Introduction to Spark Streaming

Chapter 9 : Introduction to GraphX

Chapter 10 : You Made It! Where to Go from Here