O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Spark in Action Video Edition

Video Description

"Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide."
Jonathan Sharley, Pandora Media

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code.

Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades.
Inside:

  • Updated for Spark 2.0
  • Real-life case studies
  • Spark DevOps with Docker
  • Examples in Scala, and online in Java and Python
Made for experienced programmers with some background in big data or machine learning.

Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community.

Must-have! Speed up your learning of Spark as a distributed computing framework.
Robert Ormandi, Yahoo!

An easy-to-follow, step-by-step guide.
Gaurav Bhardwaj, 3Pillar Global

An ambitiously comprehensive overview of Spark and its diverse ecosystem.
Jonathan Miller, Optensity

NARRATED BY KYLE JACKSON AND MARK THOMAS

Table of Contents

  1. PART 1: FIRST STEPS
    1. Chapter 1. Introduction to Apache Spark 00:08:53
    2. Chapter 1. What Spark brings to the table 00:06:52
    3. Chapter 1. Spark components 00:06:57
    4. Chapter 1. Spark program flow 00:08:57
    5. Chapter 1. Setting up the spark-in-action VM 00:06:45
    6. Chapter 2. Spark fundamentals 00:07:12
    7. Chapter 2. Using the VM’s Hadoop installation 00:05:34
    8. Chapter 2. Using Spark shell and writing your first Spark program 00:11:07
    9. Chapter 2. Basic RDD actions and transformations 00:05:31
    10. Chapter 2. Using the distinct and flatMap transformations 00:09:01
    11. Chapter 2. Obtaining RDD’s elements with the sample, take, and takeSample operations 00:05:10
    12. Chapter 2. Double RDD functions 00:09:12
    13. Chapter 3. Writing Spark applications 00:11:22
    14. Chapter 3. Developing the application 00:09:49
    15. Chapter 3. Running the application from Eclipse 00:11:05
    16. Chapter 3. Broadcast variables 00:06:40
    17. Chapter 3. Submitting the application 00:09:06
    18. Chapter 3. Using spark-submit 00:05:29
    19. Chapter 4. The Spark API in depth 00:05:49
    20. Chapter 4. Basic pair RDD functions 00:08:18
    21. Chapter 4. Using the flatMapValues transformation to add values to keys 00:08:13
    22. Chapter 4. Understanding data partitioning and reducing data shuffling 00:07:52
    23. Chapter 4. Understanding and avoiding unnecessary shuffling 00:10:21
    24. Chapter 4. Repartitioning RDDs 00:08:15
    25. Chapter 4. Joining, sorting, and grouping data 00:11:24
    26. Chapter 4. Joining data 00:07:06
    27. Chapter 4. Sorting data 00:10:05
    28. Chapter 4. Grouping data 00:07:41
    29. Chapter 4. Understanding RDD dependencies 00:09:47
    30. Chapter 4. Using accumulators and broadcast variables to communicate with Spark executors 00:07:22
    31. Chapter 4. Sending data to executors using broadcast variables 00:06:38
  2. PART 2: MEET THE SPARK FAMILY
    1. Chapter 5. Sparkling queries with Spark SQL 00:09:58
    2. Chapter 5. Creating DataFrames from RDDs 00:07:47
    3. Chapter 5. Creating a DataFrame from an RDD of tuples 00:08:10
    4. Chapter 5. DataFrame API basics 00:08:01
    5. Chapter 5. Using SQL functions to perform calculations on data 00:11:14
    6. Chapter 5. Working with missing values 00:05:43
    7. Chapter 5. Grouping and joining data 00:10:18
    8. Chapter 5. Beyond DataFrames: introducing DataSets 00:05:20
    9. Chapter 5. Table catalog and Hive metastore 00:06:49
    10. Chapter 5. Executing SQL queries 00:08:33
    11. Chapter 5. Saving and loading DataFrame data 00:05:21
    12. Chapter 5. Saving data 00:10:20
    13. Chapter 5. Catalyst optimizer 00:11:29
    14. Chapter 6. Ingesting data with Spark Streaming 00:09:33
    15. Chapter 6. Creating a discretized stream 00:08:55
    16. Chapter 6. Saving the results to a file 00:07:19
    17. Chapter 6. Saving the computation state over time 00:08:02
    18. Chapter 6. Specifying the checkpointing directory 00:06:18
    19. Chapter 6. Using window operations for time-limited calculations 00:06:59
    20. Chapter 6. Using external data sources 00:05:52
    21. Chapter 6. Changing the streaming application to use Kafka 00:09:19
    22. Chapter 6. Performance of Spark Streaming jobs 00:10:29
    23. Chapter 6. Structured Streaming 00:11:45
    24. Chapter 7. Getting smart with MLlib 00:12:12
    25. Chapter 7. Classification of machine-learning algorithms 00:10:27
    26. Chapter 7. Linear algebra in Spark 00:10:24
    27. Chapter 7. Distributed matrices 00:03:44
    28. Chapter 7. Linear regression 00:07:01
    29. Chapter 7. Expanding the model to multiple linear regression 00:05:19
    30. Chapter 7. Analyzing and preparing the data 00:11:20
    31. Chapter 7. Fitting and using a linear regression model 00:07:23
    32. Chapter 7. Tweaking the algorithm 00:10:52
    33. Chapter 7. Plotting residual plots 00:09:44
    34. Chapter 7. Optimizing linear regression 00:10:32
    35. Chapter 8. ML: classification and clustering 00:09:42
    36. Chapter 8. Logistic regression 00:06:59
    37. Chapter 8. Preparing data to use logistic regression in Spark 00:08:58
    38. Chapter 8. Training the model 00:12:31
    39. Chapter 8. Performing k-fold cross-validation 00:07:51
    40. Chapter 8. Decision trees and random forests 00:06:51
    41. Chapter 8. Decision trees 00:08:06
    42. Chapter 8. Random forests 00:03:49
    43. Chapter 8. Using k-means clustering 00:03:59
    44. Chapter 8. K-means clustering 00:11:03
    45. Chapter 8. Summary 00:03:01
    46. Chapter 9. Connecting the dots with GraphX 00:09:55
    47. Chapter 9. Transforming graphs 00:10:03
    48. Chapter 9. Graph algorithms 00:13:17
    49. Chapter 9. Implementing the A* search algorithm 00:04:59
    50. Chapter 9. Implementing the A* algorithm 00:11:37
    51. Chapter 9. Summary 00:02:47
  3. PART 3: SPARK OPS
    1. Chapter 10. Running Spark 00:11:42
    2. Chapter 10. Job and resource scheduling 00:10:01
    3. Chapter 10. Data-locality considerations 00:06:58
    4. Chapter 10. Configuring Spark 00:07:50
    5. Chapter 10. Spark web UI 00:06:25
    6. Chapter 10. Running Spark on the local machine 00:06:29
    7. Chapter 11. Running on a Spark standalone cluster 00:05:35
    8. Chapter 11. Starting the standalone cluster 00:06:37
    9. Chapter 11. Viewing Spark processes 00:05:45
    10. Chapter 11. Standalone cluster web UI 00:08:32
    11. Chapter 11. Specifying extra classpath entries and files 00:07:10
    12. Chapter 11. Spark History Server and event logging 00:08:17
    13. Chapter 11. Creating an EC2 standalone cluster 00:07:23
    14. Chapter 11. Using the EC2 cluster 00:07:31
    15. Chapter 12. Running on YARN and Mesos 00:09:10
    16. Chapter 12. Resource scheduling in YARN 00:06:59
    17. Chapter 12. Configuring Spark on YARN 00:05:07
    18. Chapter 12. Configuring resources for Spark jobs 00:08:16
    19. Chapter 12. Finding logs on YARN 00:09:44
    20. Chapter 12. Running Spark on Mesos 00:09:44
    21. Chapter 12. Installing and configuring Mesos 00:04:52
    22. Chapter 12. Mesos resource scheduling 00:09:13
    23. Chapter 12. Running Spark with Docker 00:10:52
  4. PART 4: BRINGING IT TOGETHER
    1. Chapter 13. Case study: real-time dashboard 00:06:30
    2. Chapter 13. Running the application 00:08:59
    3. Chapter 13. Starting the application manually 00:04:23
    4. Chapter 13. Understanding the source code 00:09:25
    5. Chapter 13. The StreamingLogAnalyzer project 00:08:52
    6. Chapter 14. Deep learning on Spark with H2O 00:07:01
    7. Chapter 14. Using H2O with Spark 00:11:47
    8. Chapter 14. Performing regression with H2O’s deep learning 00:11:11
    9. Chapter 14. Building and evaluating a deep-learning model using the Sparkling Water API 00:04:41
    10. Chapter 14. Performing classification with H2O’s deep learning 00:10:06