O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Spark: Tips, Tricks, & Techniques

Video Description

Discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

About This Video

  • Speed up your Spark jobs by reducing shuffles
  • Leverage the Key/Value API in your big data processing to make your jobs work faster with lower network traffic
  • Test Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bullet proof

In Detail

Apache Spark has been around for quite some time, but do you really know how to get the most out of Spark? This course aims at giving you new possibilities; you will explore many aspects of Spark, some you may have never heard of and some you never knew existed.

In this course you'll learn to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You will explore 7 sections that will address different aspects of Spark via 5 specific techniques with clear instructions on how to carry out different Apache Spark tasks with hands-on experience. The techniques are demonstrated using practical examples and best practices.

By the end of this course, you will have learned some exciting tips, best practices, and techniques with Apache Spark. You will be able to perform tasks and get the best data out of your databases much faster and with ease.

All the code and supporting files for this course are available on Github at https://github.com/PacktPublishing/Apache-Spark-Tips-Tricks-Techniques

Downloading the example code for this course: You can download the example code files for all Packt video courses you have purchased from your account at http://www.PacktPub.com. If you purchased this course elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Chapter 1 : Transformations and Actions
    1. The Course Overview 00:02:18
    2. Using Spark Transformations to Defer Computations to a Later Time 00:04:32
    3. Avoiding Transformations 00:03:52
    4. Using reduce and reduceByKey to Calculate Results 00:05:07
    5. Performing Actions That Trigger Computations 00:05:17
    6. Reusing the Same RDD for Different Actions 00:03:42
  2. Chapter 2 : Immutable Design
    1. Delve into Spark RDDs Parent/Child Chain 00:06:12
    2. Using RDD in an Immutable Way 00:03:19
    3. Using DataFrame Operations to Transform It 00:03:29
    4. Immutability in the Highly Concurrent Environment 00:04:38
    5. Using Dataset API in an Immutable Way 00:02:46
  3. Chapter 3 : Avoid Shuffle and Reduce Operational Expenses
    1. Detecting a Shuffle in a Processing 00:04:43
    2. Testing Operations That Cause Shuffle in Apache Spark 00:04:13
    3. Changing Design of Jobs with Wide Dependencies 00:03:13
    4. Using keyBy() Operations to Reduce Shuffle 00:03:36
    5. Using Custom Partitioner to Reduce Shuffle 00:03:28
  4. Chapter 4 : Saving Data in the Correct Format
    1. Saving Data in Plain Text 00:04:57
    2. Leveraging JSON as a Data Format 00:04:11
    3. Tabular Formats – CSV 00:03:40
    4. Using Avro with Spark 00:04:15
    5. Columnar Formats – Parquet 00:03:45
  5. Chapter 5 : Working with Spark Key/Value API
    1. Available Transformations on Key/Value Pairs 00:04:23
    2. Using aggregateByKey Instead of groupBy() 00:04:53
    3. Actions on Key/Value Pairs 00:03:14
    4. Available Partitioners on Key/Value Data 00:04:26
    5. Implementing Custom Partitioner 00:04:55
  6. Chapter 6 : Testing Apache Spark Jobs
    1. Separating Logic from Spark Engine – Unit Testing 00:04:15
    2. Integration Testing Using SparkSession 00:03:27
    3. Mocking Data Sources Using Partial Functions 00:04:10
    4. Using ScalaCheck for Property-Based Testing 00:03:53
    5. Testing in Different Versions of Spark 00:03:26
  7. Chapter 7 : Leveraging Spark GraphX API
    1. Creating Graph from Datasource 00:03:34
    2. Using Vertex API 00:05:01
    3. Using Edge API 00:03:00
    4. Calculate Degree of Vertex 00:03:57
    5. Calculate Page Rank 00:04:49