O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Spark 2.x Cookbook

Book Description

Over 70 recipes to help you use Apache Spark as your single big data computing platform and master its libraries

About This Book

  • This book contains recipes on how to use Apache Spark as a unified compute engine
  • Cover how to connect various source systems to Apache Spark
  • Covers various parts of machine learning including supervised/unsupervised learning & recommendation engines

Who This Book Is For

This book is for data engineers, data scientists, and those who want to implement Spark for real-time data processing. Anyone who is using Spark (or is planning to) will benefit from this book. The book assumes you have a basic knowledge of Scala as a programming language.

What You Will Learn

  • Install and configure Apache Spark with various cluster managers & on AWS
  • Set up a development environment for Apache Spark including Databricks Cloud notebook
  • Find out how to operate on data in Spark with schemas
  • Get to grips with real-time streaming analytics using Spark Streaming & Structured Streaming
  • Master supervised learning and unsupervised learning using MLlib
  • Build a recommendation engine using MLlib
  • Graph processing using GraphX and GraphFrames libraries
  • Develop a set of common applications or project types, and solutions that solve complex big data problems

In Detail

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data.

Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark.

Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.

Style and approach

This book is packed with intuitive recipes supported with line-by-line explanations to help you understand Spark 2.x’s real-time processing capabilities and deploy scalable big data solutions. This is a valuable resource for data scientists and those working on large-scale data projects.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. www.PacktPub.com
  2. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Sections
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    5. Conventions
    6. Reader feedback
    7. Customer support
      1. Downloading the color images of this book
      2. Errata
      3. Piracy
      4. Questions
  3. Getting Started with Apache Spark
    1. Introduction
    2. Leveraging Databricks Cloud
      1. How to do it...
      2. How it works...
        1. Cluster
        2. Notebook
        3. Table
        4. Library
    3. Deploying Spark using Amazon EMR
      1. What it represents is much bigger than what it looks
      2. EMR's architecture
      3. How to do it...
      4. How it works...
        1. EC2 instance types
          1. T2 - Free Tier Burstable (EBS only)
          2. M4 - General purpose (EBS only)
          3. C4 - Compute optimized
          4. X1 - Memory optimized
          5. R4 - Memory optimized
          6. P2 - General purpose GPU
          7. I3 - Storage optimized
          8. D2 - Storage optimized
    4. Installing Spark from binaries
      1. Getting ready
      2. How to do it...
    5. Building the Spark source code with Maven
      1. Getting ready
      2. How to do it...
    6. Launching Spark on Amazon EC2
      1. Getting ready
      2. How to do it...
      3. See also
    7. Deploying Spark on a cluster in standalone mode
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    8. Deploying Spark on a cluster with Mesos
      1. How to do it...
    9. Deploying Spark on a cluster with YARN
      1. Getting ready
      2. How to do it...
      3. How it works...
    10. Understanding SparkContext and SparkSession
      1. SparkContext
      2. SparkSession
    11. Understanding resilient distributed dataset - RDD
      1. How to do it...
  4. Developing Applications with Spark
    1. Introduction
    2. Exploring the Spark shell
      1. How to do it...
      2. There's more...
    3. Developing a Spark applications in Eclipse with Maven
      1. Getting ready
      2. How to do it...
    4. Developing a Spark applications in Eclipse with SBT
      1. How to do it...
    5. Developing a Spark application in IntelliJ IDEA with Maven
      1. How to do it...
    6. Developing a Spark application in IntelliJ IDEA with SBT
      1. How to do it...
    7. Developing applications using the Zeppelin notebook
      1. How to do it...
    8. Setting up Kerberos to do authentication
      1. How to do it...
      2. There's more...
    9. Enabling Kerberos authentication for Spark
      1. How to do it...
      2. There's more...
        1. Securing data at rest
        2. Securing data in transit
  5. Spark SQL
    1. Understanding the evolution of schema awareness
      1. Getting ready
        1. DataFrames
        2. Datasets
        3. Schema-aware file formats
    2. Understanding the Catalyst optimizer
      1. Analysis
      2. Logical plan optimization
      3. Physical planning
      4. Code generation
    3. Inferring schema using case classes
      1. How to do it...
      2. There's more...
    4. Programmatically specifying the schema
      1. How to do it...
      2. How it works...
    5. Understanding the Parquet format
      1. How to do it...
      2. How it works...
        1. Partitioning
        2. Predicate pushdown
        3. Parquet Hive interoperability
    6. Loading and saving data using the JSON format
      1. How to do it...
      2. How it works...
    7. Loading and saving data from relational databases
      1. Getting ready
      2. How to do it...
    8. Loading and saving data from an arbitrary source
      1. How to do it...
      2. There's more...
    9. Understanding joins
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Shuffle hash join
        2. Broadcast hash join
        3. The cartesian join
      4. There's more...
    10. Analyzing nested structures
      1. Getting ready
      2. How to do it...
  6. Working with External Data Sources
    1. Introduction
    2. Loading data from the local filesystem
      1. How to do it...
    3. Loading data from HDFS
      1. How to do it...
    4. Loading data from Amazon S3
      1. How to do it...
    5. Loading data from Apache Cassandra
      1. How to do it...
      2. How it works
        1. CAP Theorem
        2. Cassandra partitions
        3. Consistency levels
  7. Spark Streaming
    1. Introduction
      1. Classic Spark Streaming
      2. Structured Streaming
    2. WordCount using Structured Streaming
      1. How to do it...
    3. Taking a closer look at Structured Streaming
      1. How to do it...
      2. There's more...
    4. Streaming Twitter data
      1. How to do it...
    5. Streaming using Kafka
      1. Getting ready
      2. How to do it...
    6. Understanding streaming challenges
      1. Late arriving/out-of-order data
      2. Maintaining the state in between batches
      3. Message delivery reliability
      4. Streaming is not an island
  8. Getting Started with Machine Learning
    1. Introduction
    2. Creating vectors
      1. Getting ready
      2. How to do it...
      3. How it works...
    3. Calculating correlation
      1. Getting ready
      2. How to do it...
    4. Understanding feature engineering
      1. Feature selection
        1. Quality of features
        2. Number of features
      2. Feature scaling
      3. Feature extraction
        1. TF-IDF
          1. Term frequency
          2. Inverse document frequency
      4. How to do it...
    5. Understanding Spark ML
      1. Getting ready
      2. How to do it...
    6. Understanding hyperparameter tuning
      1. How to do it...
  9. Supervised Learning with MLlib — Regression
    1. Introduction
    2. Using linear regression
      1. Getting ready
      2. How to do it...
      3. There's more...
    3. Understanding the cost function
      1. There's more...
    4. Doing linear regression with lasso
      1. Bias versus variance
      2. How to do it...
    5. Doing ridge regression
  10. Supervised Learning with MLlib — Classification
    1. Introduction
    2. Doing classification using logistic regression
      1. Getting ready
      2. How to do it...
      3. There's more...
        1. What is ROC?
    3. Doing binary classification using SVM
      1. Getting ready
      2. How to do it...
    4. Doing classification using decision trees
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    5. Doing classification using random forest
      1. Getting ready
      2. How to do it...
    6. Doing classification using gradient boosted trees
      1. Getting ready
      2. How to do it...
    7. Doing classification with Naïve Bayes
      1. Getting ready
      2. How to do it...
  11. Unsupervised Learning
    1. Introduction
    2. Clustering using k-means
      1. Getting ready
      2. How to do it...
    3. Dimensionality reduction with principal component analysis
      1. Getting ready
      2. How to do it...
    4. Dimensionality reduction with singular value decomposition
      1. Getting ready
      2. How to do it...
  12. Recommendations Using Collaborative Filtering
    1. Introduction
    2. Collaborative filtering using explicit feedback
      1. Getting ready
      2. How to do it...
        1. Adding my recommendations and then testing predictions
      3. There's more...
    3. Collaborative filtering using implicit feedback
      1. How to do it...
  13. Graph Processing Using GraphX and GraphFrames
    1. Introduction
    2. Fundamental operations on graphs
      1. Getting ready
      2. How to do it...
    3. Using PageRank
      1. Getting ready
      2. How to do it...
    4. Finding connected components
      1. Getting ready
      2. How to do it...
    5. Performing neighborhood aggregation
      1. Getting ready
      2. How to do it...
    6. Understanding GraphFrames
      1. How to do it...
  14. Optimizations and Performance Tuning
    1. Optimizing memory
      1. How to do it...
      2. How it works...
        1. Garbage collection
          1. Mark and sweep
          2. G1
        2. Spark memory allocation
    2. Leveraging speculation
      1. How to do it...
    3. Optimizing joins
      1. How to do it...
    4. Using compression to improve performance
      1. How to do it...
    5. Using serialization to improve performance
      1. How to do it...
      2. There's more...
    6. Optimizing the level of parallelism
      1. How to do it...
    7. Understanding project Tungsten
      1. How to do it...
      2. How it works...
        1. Tungsten phase 1
          1. Bypassing GC
          2. Cache conscious computation
          3. Code generation for expression evaluation
        2. Tungsten phase 2
          1. Wholesale code generation
          2. In-memory columnar format