Apache Spark with Python - Big Data with PySpark and Spark

Video Description

Learn Apache Spark and Python by 12+ hands-on examples of analyzing big data with PySpark and Spark

About This Video

  • Apache Spark gives us unlimited ability to build cutting-edge applications. It is also one of the most compelling technologies of the last decade in terms of its disruption to the big data world.
  • Spark provides in-memory cluster computing which greatly boosts the speed of iterative algorithms and interactive data mining tasks.

In Detail

This course covers all the fundamentals of Apache Spark with Python and teaches you everything you need to know about developing Spark applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Apache Spark and general big data analysis and manipulations skills to help your company to adopt Apache Spark for building big data processing pipeline and data analytics applications. This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems. Together we will learn examples such as aggregating NASA Apache weblogs from different sources; we will explore the price trend by looking at the real estate data in California; we will write Spark applications to find out the median salary of developers in different countries through the Stack Overflow survey data; we will develop a system to analyze how maker spaces are distributed across different regions in the United Kingdom. And much much more.

Publisher Resources

Download Example Code

Table of Contents

  1. Chapter 1 : Get Started with Apache Spark
    1. Course Overview 00:04:09
    2. Introduction to Spark 00:02:28
    3. Install Java and Git 00:08:53
    4. Set up Spark 00:09:23
    5. Run our first Spark job 00:03:49
  2. Chapter 2 : RDD
    1. RDD Basics 00:02:50
    2. Create RDDs 00:02:33
    3. Map and Filter Transformation 00:09:29
    4. Solution to Airports by Latitude Problem 00:01:58
    5. FlatMap Transformation 00:03:46
    6. Set Operations 00:08:26
    7. Solution for the Same Hosts Problem 00:01:55
    8. Actions 00:09:03
    9. Solution to Sum of Numbers Problem 00:02:07
    10. Important Aspects about RDD 00:01:40
    11. Summary of RDD Operations 00:02:26
    12. Caching and Persistance 00:05:16
  3. Chapter 3 : Spark Architecture and Components
    1. Spark Architecture 00:03:01
    2. Spark Components 00:05:26
  4. Chapter 4 : Pair RDD
    1. Introduction to Pair RDD 00:01:38
    2. Create Pair RDDs 00:04:15
    3. Filter and MapValue Transformations on Pair RDD 00:05:17
    4. Reduce By Key Aggregation 00:05:39
    5. Solution for the Average House Problem 00:03:25
    6. Group By Key Transformation 00:05:15
    7. Sort By Key Transformation 00:02:51
    8. Solution for the Sorted Word Count Problem 00:03:24
    9. Data Partitioning 00:04:19
    10. Join Operations 00:05:13
  5. Chapter 5 : Advanced Spark Topics
    1. Accumulators 00:03:44
    2. Solution to StackOverflow Survey Follow-up Problem 00:01:06
    3. Broadcast Variables 00:06:46
  6. Chapter 6 : Spark SQL
    1. Introduction to Spark SQL 00:03:56
    2. Spark SQL in Action 00:13:12
    3. Spark SQL practice: House Price Problem 00:01:54
    4. Spark SQL Joins 00:07:04
    5. Dataframe or RDD 00:02:57
    6. Dataframe and RDD Conversion 00:02:55
    7. Performance Tuning of Spark SQL 00:02:52
  7. Chapter 7 : Running Spark in a Cluster
    1. Introduction to Running Spark in a Cluster 00:04:06
    2. Spark-submit 00:02:41
    3. Run Spark Application on Amazon EMR (ElasticMapReduce) cluster 00:15:10

Product Information

  • Title: Apache Spark with Python - Big Data with PySpark and Spark
  • Author(s): Pedro Magalhães Bernardo, Tao W, James Lee
  • Release date: April 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789133394