Streaming Big Data with Spark Streaming, Scala, and Spark 3!

Video description

In this course, you will learn the basics of the Scala programming language; learn how Apache Spark operates on a cluster; set up discretized streams with Spark Streaming and transform them as data is received; analyze streaming data over sliding windows of time; maintain stateful information across streams of data; connect Spark Streaming with highly scalable sources of data, including Kafka, Flume, and Kinesis; dump streams of data in real-time to NoSQL databases such as Cassandra; run SQL queries on streamed data in real-time; train machine learning models in real-time with streaming data, and use them to make predictions that keep getting better over time; and also, package, deploy, and run self-contained Spark Streaming code to a real Hadoop cluster using Amazon Elastic MapReduce.

This course is very hands-on, filled with achievable activities and exercises to reinforce your learning. By the end of this course, you will be confidently creating Spark Streaming scripts in Scala and be prepared to tackle massive streams of data in a whole new way. You will be surprised at how easy Spark Streaming makes it!

What You Will Learn

  • Process large amounts of real-time data using the Spark Streaming module
  • Create efficient Spark applications using the Scala programming language
  • Integrate Spark Streaming with various data sources
  • Integrate Spark Streaming with Spark SQL to query your data in real time
  • Train machine learning models with streaming data, and use for real-time predictions
  • Maintain stateful data across a continuous stream of input data

Audience

If you are a student who wants to learn how to use Apache Spark or a big data professional who wants to process large amounts of data on a real-time basis, this course is for you. Some basic programming and scripting experience is required to get the most out of the course.

About The Author

Frank Kane: Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.

Table of contents

  1. Chapter 1 : Getting Started
    1. Introduction, and Getting Set Up
    2. [Activity] Stream Live Tweets with Spark Streaming!
  2. Chapter 2 : A Crash Course in Scala
    1. [Activity] Scala Basics
    2. [Exercise] Flow Control in Scala
    3. [Exercise] Functions in Scala
    4. [Exercise] Data Structures in Scala
  3. Chapter 3 : Spark Streaming Concepts
    1. Introduction to Spark
    2. The Resilient Distributed Dataset (RDD)
    3. [Activity] RDD's in Action: Simple Word Count Application
    4. Introduction to Spark Streaming
    5. [Activity] Revisiting the PrintTweets application
    6. Windowing: Aggregating data over longer time spans
    7. Fault Tolerance in Spark Streaming
  4. Chapter 4 : Spark Streaming Examples with Twitter
    1. [Exercise] Saving Tweets to Disk
    2. [Exercise] Tracking the Average Tweet Length
    3. [Exercise] Tracking the Most Popular Hashtags
  5. Chapter 5 : Spark Streaming Examples with Clickstream / Apache Access Log Data
    1. [Exercise] Tracking the Top URL's Requested
    2. [Exercise] Alarming on Log Errors
    3. [Exercise] Integrating Spark Streaming with Spark SQL
    4. Introduction to Structured Streaming
    5. [Activity] Analyzing Apache Log files with Structured Streaming
  6. Chapter 6 : Integrating with Other Systems
    1. Integrating with Apache Kafka
    2. Integrating with Apache Flume
    3. Integrating with Amazon Kinesis
    4. [Activity] Writing Custom Data Receivers
    5. Integrating with Cassandra
  7. Chapter 7 : Advanced Spark Streaming Examples
    1. [Exercise] Stateful Information in Spark Streams
    2. [Activity] Streaming K-Means Clustering
    3. [Activity] Streaming Linear Regression
  8. Chapter 8 : Spark Streaming in Production
    1. [Activity] Packaging and Running Spark Code in Production
    2. [Activity] Packaging Your Code with SBT
    3. Running on a Real Hadoop Cluster with EMR
    4. Troubleshooting and Tuning Spark Jobs
  9. Chapter 9 : You Made It!
    1. Learning More

Product information

  • Title: Streaming Big Data with Spark Streaming, Scala, and Spark 3!
  • Author(s): Frank Kane
  • Release date: August 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781787123915