O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Taming Big Data with Apache Spark and Python - Hands On!

Video Description

More than 15 hands-on examples to help you analyze large data sets with Apache Spark

About This Video

  • Understand how Spark can be distributed across computing clusters

  • Develop and run Spark jobs efficiently using Python

  • A hands-on tutorial with over 15 real-world examples teaching you Big Data processing with Spark

  • In Detail

    Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis. This course will be your companion to learn Apache Spark in a hands-on manner. Start with understanding how to set up Spark on a single system or on a cluster. From analyzing large data sets using Spark RDD, to developing and running effective Spark jobs quickly using Python, this course will teach you everything. Packed with over 15 interactive, fun-filled examples relevant to the real-world, the course will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease.

    Table of Contents

    1. Chapter 1 : Getting Started with Spark
      1. Introduction 00:02:16
      2. How to Use This Course 00:01:41
      3. Getting Set Up – Installing Python, a JDK, Spark, and its Dependencies 00:14:53
      4. Installing the MovieLens Movie Rating Dataset 00:03:35
      5. Run Your First Spark Program – Ratings Histogram Example 00:04:53
    2. Chapter 2 : Spark Basics and Simple Examples
      1. Introduction to Spark 00:10:12
      2. The Resilient Distributed Dataset (RDD) Z 00:12:17
      3. Ratings Histogram Walkthrough 00:13:34
      4. Key/Value RDDs and the Average Friends by Age Example 00:16:13
      5. Running the Average Friends by Age Example 00:05:39
      6. Filtering RDDs and the Minimum Temperature by Location Example 00:08:10
      7. Running the Minimum Temperature Example and Modifying It for Maximums 00:05:09
      8. Running the Maximum Temperature by Location Example 00:03:22
      9. Counting Word Occurrences Using flatmap() 00:07:28
      10. Improving the Word Count Script with Regular Expressions 00:04:45
      11. Sorting the Word Count Results 00:07:45
      12. Find the Total Amount Spent by Customer 00:04:01
      13. Check Your Results and Sort Them by Total Amount Spent 00:05:08
      14. Check Your Sorted Implementation and Results Against Mine 00:03:19
    3. Chapter 3 : Advanced Examples of Spark Programs
      1. Find the Most Popular Movie 00:05:53
      2. Use Broadcast Variables to Display Movie Names Instead of ID Numbers 00:08:24
      3. Find the Most Popular Superhero in a Social Graph 00:04:29
      4. Run the Script – Discover Who the Most Popular Superhero is! 00:06:00
      5. Superhero Degrees of Separation – Introducing Breadth-First Search 00:07:54
      6. Superhero Degrees of Separation – Accumulators and Implementing BFS in Spark 00:06:45
      7. Superhero Degrees of Separation – Review the Code and Run it 00:09:14
      8. Item-Based Collaborative Filtering in Spark, cache(), and persist() 00:10:13
      9. Running the Similar Movies Script Using Spark's Cluster Manager 00:10:55
      10. Improve the Quality of Similar Movies 00:02:58
    4. Chapter 4 : Running Spark on a Cluster
      1. Introducing Elastic MapReduce 00:05:08
      2. Setting Up Your AWS / Elastic MapReduce Account and PuTTY 00:09:56
      3. Partitioning 00:04:22
      4. Create Similar Movies from One Million Ratings – Part 1 00:05:12
      5. Create Similar Movies from One Million Ratings – Part 2 00:11:28
      6. Create Similar Movies from One Million Ratings – Part 3 00:03:29
      7. Troubleshooting Spark on a Cluster 00:03:43
      8. More Troubleshooting and Managing Dependencies 00:05:48
    5. Chapter 5 : SparkSQL, DataFrames, and DataSets
      1. Introducing SparkSQL 00:06:08
      2. Executing SQL Commands and SQL-Style Functions on a DataFrame 00:08:17
      3. Using DataFrames Instead of RDDs 00:05:53
    6. Chapter 6 : Other Spark Technologies and Libraries
      1. Introducing MLLib 00:08:10
      2. Using MLLib to Produce Movie Recommendations 00:02:57
      3. Analyzing the ALS Recommendations Results 00:04:53
      4. Using DataFrames with MLLib 00:07:32
      5. Spark Streaming and GraphX 00:07:36
    7. Chapter 7 : You Made It! Where to Go from Here
      1. Learning More about Spark and Data Science 00:04:09