O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Apache Spark 2.x - Second Edition

Book Description

Advanced analytics on your Big Data with latest Apache Spark 2.x

About This Book

  • An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities.
  • Extend your data processing capabilities to process huge chunk of data in minimum time using advanced concepts in Spark.
  • Master the art of real-time processing with the help of Apache Spark 2.x

Who This Book Is For

If you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.

What You Will Learn

  • Examine Advanced Machine Learning and DeepLearning with MLlib, SparkML, SystemML, H2O and DeepLearning4J
  • Study highly optimised unified batch and real-time data processing using SparkSQL and Structured Streaming
  • Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames
  • Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud
  • Understand internal details of cost based optimizers used in Catalyst, SystemML and GraphFrames
  • Learn how specific parameter settings affect overall performance of an Apache Spark cluster
  • Leverage Scala, R and python for your data science projects

In Detail

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and SQL. This book aims to take your knowledge of Spark to the next level by teaching you how to expand Spark’s functionality and implement your data flows and machine/deep learning programs on top of the platform.

The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.

You will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML, and Deeplearning4j for machine learning, and Jupyter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.

You will also learn about the updates on the APIs and how DataFrames and Datasets affect SQL, machine learning, graph processing, and streaming. You will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.

Style and approach

This book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. A First Taste and What’s New in Apache Spark V2
    1. Spark machine learning
    2. Spark Streaming
    3. Spark SQL
    4. Spark graph processing
    5. Extended ecosystem
    6. What's new in Apache Spark V2?
    7. Cluster design
    8. Cluster management
      1. Local
      2. Standalone
      3. Apache YARN
      4. Apache Mesos
    9. Cloud-based deployments
    10. Performance
      1. The cluster structure
      2. Hadoop Distributed File System
      3. Data locality
      4. Memory
      5. Coding
    11. Cloud
    12. Summary
  3. Apache Spark SQL
    1. The SparkSession--your gateway to structured data processing
    2. Importing and saving data
      1. Processing the text files
      2. Processing JSON files
      3. Processing the Parquet files
    3. Understanding the DataSource API
      1. Implicit schema discovery
      2. Predicate push-down on smart data sources
    4. DataFrames
    5. Using SQL
      1. Defining schemas manually
      2. Using SQL subqueries
      3. Applying SQL table joins
    6. Using Datasets
      1. The Dataset API in action
    7. User-defined functions
    8. RDDs versus DataFrames versus Datasets
    9. Summary
  4. The Catalyst Optimizer
    1. Understanding the workings of the Catalyst Optimizer
    2. Managing temporary views with the catalog API
    3. The SQL abstract syntax tree
    4. How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
      1. Internal class and object representations of LEPs
      2. How to optimize the Resolved Logical Execution Plan
        1. Physical Execution Plan generation and selection
    5. Code generation
      1. Practical examples
      2. Using the explain method to obtain the PEP
      3. How smart data sources work internally
    6. Summary
  5. Project Tungsten
    1. Memory management beyond the Java Virtual Machine Garbage Collector
      1. Understanding the UnsafeRow object
        1. The null bit set region
        2. The fixed length values region
        3. The variable length values region
      2. Understanding the BytesToBytesMap
      3. A practical example on memory usage and performance
    2. Cache-friendly layout of data in memory
      1. Cache eviction strategies and pre-fetching
    3. Code generation
      1. Understanding columnar storage
      2. Understanding whole stage code generation
        1. A practical example on whole stage code generation performance
        2. Operator fusing versus the volcano iterator model
    4. Summary
  6. Apache Spark Streaming
    1. Overview
    2. Errors and recovery
      1. Checkpointing
    3. Streaming sources
      1. TCP stream
      2. File streams
      3. Flume
      4. Kafka
    4. Summary
  7. Structured Streaming
    1. The concept of continuous applications
      1. True unification - same code, same engine
    2. Windowing
      1. How streaming engines use windowing
      2. How Apache Spark improves windowing
    3. Increased performance with good old friends
    4. How transparent fault tolerance and exactly-once delivery guarantee is achieved
      1. Replayable sources can replay streams from a given offset
      2. Idempotent sinks prevent data duplication
      3. State versioning guarantees consistent results after reruns
    5. Example - connection to a MQTT message broker
      1. Controlling continuous applications
      2. More on stream life cycle management
    6. Summary
  8. Apache Spark MLlib
    1. Architecture
      1. The development environment
    2. Classification with Naive Bayes
      1. Theory on Classification
      2. Naive Bayes in practice
    3. Clustering with K-Means
      1. Theory on Clustering
      2. K-Means in practice
    4. Artificial neural networks
      1. ANN in practice
    5. Summary
  9. Apache SparkML
    1. What does the new API look like?
    2. The concept of pipelines
      1. Transformers
        1. String indexer
        2. OneHotEncoder
        3. VectorAssembler
      2. Pipelines
      3. Estimators
        1. RandomForestClassifier
    3. Model evaluation
    4. CrossValidation and hyperparameter tuning
      1. CrossValidation
      2. Hyperparameter tuning
    5. Winning a Kaggle competition with Apache SparkML
      1. Data preparation
      2. Feature engineering
      3. Testing the feature engineering pipeline
      4. Training the machine learning model
      5. Model evaluation
      6. CrossValidation and hyperparameter tuning
      7. Using the evaluator to assess the quality of the cross-validated and tuned model
    6. Summary
  10. Apache SystemML
    1. Why do we need just another library?
      1. Why on Apache Spark?
      2. The history of Apache SystemML
    2. A cost-based optimizer for machine learning algorithms
      1. An example - alternating least squares
      2. ApacheSystemML architecture
        1. Language parsing
        2. High-level operators are generated
        3. How low-level operators are optimized on
    3. Performance measurements
    4. Apache SystemML in action
    5. Summary
  11. Deep Learning on Apache Spark with DeepLearning4j and H2O
    1. H2O
      1. Overview
        1. The build environment
        2. Architecture
        3. Sourcing the data
        4. Data quality
        5. Performance tuning
        6. Deep Learning
          1. Example code – income
        7. The example code – MNIST
        8. H2O Flow
    2. Deeplearning4j
      1. ND4J - high performance linear algebra for the JVM
      2. Deeplearning4j
      3. Example: an IoT real-time anomaly detector
        1. Mastering chaos: the Lorenz attractor model
      4. Deploying the test data generator
        1. Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
        2. Deploying the test data generator flow
        3. Testing the test data generator
      5. Install the Deeplearning4j example within Eclipse
      6. Running the examples in Eclipse
      7. Run the examples in Apache Spark
    3. Summary
  12. Apache Spark GraphX
    1. Overview
    2. Graph analytics/processing with GraphX
      1. The raw data
      2. Creating a graph
      3. Example 1 – counting
      4. Example 2 – filtering
      5. Example 3 – PageRank
      6. Example 4 – triangle counting
      7. Example 5 – connected components
    3. Summary
  13. Apache Spark GraphFrames
    1. Architecture
      1. Graph-relational translation
      2. Materialized views
      3. Join elimination
      4. Join reordering
    2. Examples
      1. Example 1 – counting
      2. Example 2 – filtering
      3. Example 3 – page rank
      4. Example 4 – triangle counting
      5. Example 5 – connected components
    3. Summary
  14. Apache Spark with Jupyter Notebooks on IBM DataScience Experience
    1. Why notebooks are the new standard
    2. Learning by example
      1. The IEEE PHM 2012 data challenge bearing dataset
      2. ETL with Scala
      3. Interactive, exploratory analysis using Python and Pixiedust
      4. Real data science work with SparkR
    3. Summary
  15. Apache Spark on Kubernetes
    1. Bare metal, virtual machines, and containers
      1. Containerization
        1. Namespaces
        2. Control groups
        3. Linux containers
    2. Understanding the core concepts of Docker
    3. Understanding Kubernetes
    4. Using Kubernetes for provisioning containerized Spark applications
    5. Example--Apache Spark on Kubernetes
      1. Prerequisites
      2. Deploying the Apache Spark master
      3. Deploying the Apache Spark workers
      4. Deploying the Zeppelin notebooks
    6. Summary