Discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs
About This Video
- Speed up your Spark jobs by reducing shuffles
- Leverage the Key/Value API in your big data processing to make your jobs work faster with lower network traffic
- Test Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bullet proof
Apache Spark has been around for quite some time, but do you really know how to get the most out of Spark? This course aims at giving you new possibilities; you will explore many aspects of Spark, some you may have never heard of and some you never knew existed.
In this course you'll learn to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You will explore 7 sections that will address different aspects of Spark via 5 specific techniques with clear instructions on how to carry out different Apache Spark tasks with hands-on experience. The techniques are demonstrated using practical examples and best practices.
By the end of this course, you will have learned some exciting tips, best practices, and techniques with Apache Spark. You will be able to perform tasks and get the best data out of your databases much faster and with ease.
All the code and supporting files for this course are available on Github at https://github.com/PacktPublishing/Apache-Spark-Tips-Tricks-Techniques
Downloading the example code for this course: You can download the example code files for all Packt video courses you have purchased from your account at http://www.PacktPub.com. If you purchased this course elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.
Table of Contents
Chapter 1 : Transformations and Actions
- The Course Overview 00:02:18
- Using Spark Transformations to Defer Computations to a Later Time 00:04:32
- Avoiding Transformations 00:03:52
- Using reduce and reduceByKey to Calculate Results 00:05:07
- Performing Actions That Trigger Computations 00:05:17
- Reusing the Same RDD for Different Actions 00:03:42
- Chapter 2 : Immutable Design
- Chapter 3 : Avoid Shuffle and Reduce Operational Expenses
- Chapter 4 : Saving Data in the Correct Format
- Chapter 5 : Working with Spark Key/Value API
- Chapter 6 : Testing Apache Spark Jobs
- Chapter 7 : Leveraging Spark GraphX API
- Title: Apache Spark: Tips, Tricks, & Techniques
- Release date: November 2018
- Publisher(s): Packt Publishing
- ISBN: 9781789801125