O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Apache Spark 2

Book Description

Learn about the fastest-growing open source project in the world, and find out how it revolutionizes big data analytics

About This Book

  • Exclusive guide that covers how to get up and running with fast data processing using Apache Spark
  • Explore and exploit various possibilities with Apache Spark using real-world use cases in this book
  • Want to perform efficient data processing at real time? This book will be your one-stop solution.

Who This Book Is For

This guide appeals to big data engineers, analysts, architects, software engineers, even technical managers who need to perform efficient data processing on Hadoop at real time. Basic familiarity with Java or Scala will be helpful.

The assumption is that readers will be from a mixed background, but would be typically people with background in engineering/data science with no prior Spark experience and want to understand how Spark can help them on their analytics journey.

What You Will Learn

  • Get an overview of big data analytics and its importance for organizations and data professionals
  • Delve into Spark to see how it is different from existing processing platforms
  • Understand the intricacies of various file formats, and how to process them with Apache Spark.
  • Realize how to deploy Spark with YARN, MESOS or a Stand-alone cluster manager.
  • Learn the concepts of Spark SQL, SchemaRDD, Caching and working with Hive and Parquet file formats
  • Understand the architecture of Spark MLLib while discussing some of the off-the-shelf algorithms that come with Spark.
  • Introduce yourself to the deployment and usage of SparkR.
  • Walk through the importance of Graph computation and the graph processing systems available in the market
  • Check the real world example of Spark by building a recommendation engine with Spark using ALS.
  • Use a Telco data set, to predict customer churn using Random Forests.

In Detail

Spark juggernaut keeps on rolling and getting more and more momentum each day. Spark provides key capabilities in the form of Spark SQL, Spark Streaming, Spark ML and Graph X all accessible via Java, Scala, Python and R. Deploying the key capabilities is crucial whether it is on a Standalone framework or as a part of existing Hadoop installation and configuring with Yarn and Mesos.

The next part of the journey after installation is using key components, APIs, Clustering, machine learning APIs, data pipelines, parallel programming. It is important to understand why each framework component is key, how widely it is being used, its stability and pertinent use cases.

Once we understand the individual components, we will take a couple of real life advanced analytics examples such as ‘Building a Recommendation system’, ‘Predicting customer churn’ and so on.

The objective of these real life examples is to give the reader confidence of using Spark for real-world problems.

Style and approach

With the help of practical examples and real-world use cases, this guide will take you from scratch to building efficient data applications using Apache Spark.

You will learn all about this excellent data processing engine in a step-by-step manner, taking one aspect of it at a time.

This highly practical guide will include how to work with data pipelines, dataframes, clustering, SparkSQL, parallel programming, and such insightful topics with the help of real-world use cases.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Learning Apache Spark 2
    1. Learning Apache Spark 2
    2. Credits
    3. About the Author
    4. About the Reviewers
    5. www.packtpub.com
      1. Why subscribe?
    6. Customer Feedback
    7. Preface
      1. The Past
        1.  Why are people so excited about Spark?
      2. What this book covers
      3. What you need for this book
      4. Who this book is for
      5. Conventions
      6. Reader feedback
      7. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Architecture and Installation
      1. Apache Spark architecture overview
        1. Spark-core
        2. Spark SQL
        3. Spark streaming
        4. MLlib
        5. GraphX
        6. Spark deployment
      2. Installing Apache Spark
      3. Writing your first Spark program
        1. Scala shell examples
        2. Python shell examples
      4. Spark architecture
        1. High level overview
          1. Driver program
          2. Cluster Manager
          3. Worker
          4. Executors
          5. Tasks
          6. SparkContext
          7. Spark Session
      5. Apache Spark cluster manager types
        1. Building standalone applications with Apache Spark
        2. Submitting applications
        3. Deployment strategies
      6. Running Spark examples
        1. Building your own programs
      7. Brain teasers
      8. References
      9. Summary
    9. 2. Transformations and Actions with Spark RDDs
      1. What is an RDD?
        1. Constructing RDDs
          1. Parallelizing existing collections
          2. Referencing external data source
      2. Operations on RDD
        1. Transformations
        2. Actions
      3. Passing functions to Spark (Scala)
        1. Anonymous functions
        2. Static singleton functions
      4. Passing functions to Spark (Java)
      5. Passing functions to Spark (Python)
      6. Transformations
        1. Map(func)
        2. Filter(func)
        3. flatMap(func)
        4. Sample (withReplacement, fraction, seed)
      7. Set operations in Spark
        1. Distinct()
        2. Intersection()
        3. Union()
        4. Subtract()
        5. Cartesian()
      8. Actions
        1. Reduce(func)
        2. Collect()
        3. Count()
        4. Take(n)
        5. First()
        6. SaveAsXXFile()
        7. foreach(func)
      9. PairRDDs
        1. Creating PairRDDs
        2. PairRDD transformations
          1. reduceByKey(func)
          2. GroupByKey(func)
          3. reduceByKey vs. groupByKey - Performance Implications
          4. CombineByKey(func)
        3. Transformations on two PairRDDs
          1. Actions available on PairRDDs
      10. Shared variables
        1. Broadcast variables
        2. Accumulators
      11. References
      12. Summary
    10. 3. ETL with Spark
      1. What is ETL?
        1. Exaction
        2. Loading
        3. Transformation
      2. How is Spark being used?
      3. Commonly Supported File Formats
        1. Text Files
        2. CSV and TSV Files
          1. Writing CSV files
          2. Tab Separated Files
        3. JSON files
        4. Sequence files
        5. Object files
      4. Commonly supported file systems
        1. Working with HDFS
        2. Working with Amazon S3
      5. Structured Data sources and Databases
        1. Working with NoSQL Databases
          1. Working with Cassandra
            1. Obtaining a Cassandra table as an RDD
            2. Saving data to Cassandra
          2. Working with HBase
            1. Bulk Delete example
            2. Map Partition Example
          3. Working with MongoDB
            1. Connection to MongoDB
            2. Writing to MongoDB
            3. Loading data from MongoDB
          4. Working with Apache Solr
            1. Importing the JAR File via Spark-shell
            2. Connecting to Solr via DataFrame API
            3. Connecting to Solr via RDD
      6. References
      7. Summary
    11. 4. Spark SQL
      1. What is Spark SQL?
      2. What is DataFrame API?
      3. What is DataSet API?
      4. What's new in Spark 2.0?
        1. Under the hood - catalyst optimizer
          1. Solution 1
          2. Solution 2
      5. The Sparksession
        1. Creating a SparkSession
      6. Creating a DataFrame
        1. Manipulating a DataFrame
          1. Scala DataFrame manipulation - examples
          2. Python DataFrame manipulation - examples
          3. R DataFrame manipulation - examples
          4. Java DataFrame manipulation - examples
        2. Reverting to an RDD from a DataFrame
        3. Converting an RDD to a DataFrame
        4. Other data sources
      7. Parquet files
      8. Working with Hive
        1. Hive configuration
      9. SparkSQL CLI
        1. Working with other databases
      10. References
      11. Summary
    12. 5. Spark Streaming
      1. What is Spark Streaming?
        1. DStream
        2. StreamingContext
      2. Steps involved in a streaming app
      3. Architecture of Spark Streaming
        1. Input sources
          1. Core/basic sources
          2. Advanced sources
          3. Custom sources
        2. Transformations
          1. Sliding window operations
        3. Output operations
      4. Caching and persistence
      5. Checkpointing
        1. Setting up checkpointing
          1. Setting up checkpointing with Scala
          2. Setting up checkpointing with Java
          3. Setting up checkpointing with Python
            1. Automatic driver restart
      6. DStream best practices
      7. Fault tolerance
        1. Worker failure impact on receivers
        2. Worker failure impact on RDDs/DStreams
        3. Worker failure impact on  output operations
      8. What is Structured Streaming?
        1. Under the hood
        2. Structured Spark Streaming API :Entry point
          1. Output modes
            1. Append mode
            2. Complete mode
            3. Update mode
          2. Output sinks
        3. Failure recovery and checkpointing
      9. References
      10. Summary
    13. 6. Machine Learning with Spark
      1. What is machine learning?
      2. Why machine learning?
      3. Types of machine learning
      4. Introduction to Spark MLLib
      5. Why do we need the Pipeline API?
      6. How does it work?
        1. Scala syntax - building a pipeline
        2. Building a pipeline
        3. Predictions on test documents
        4. Python program - predictions on test documents
      7. Feature engineering
        1. Feature extraction algorithms
        2. Feature transformation algorithms
        3. Feature selection algorithms
      8. Classification and regression
        1. Classification
        2. Regression
      9. Clustering
      10. Collaborative filtering
      11. ML-tuning - model selection and hyperparameter tuning
      12. References
      13. Summary
    14. 7. GraphX
      1. Graphs in everyday life
      2. What is a graph?
      3. Why are Graphs elegant?
      4. What is GraphX?
      5. Creating your first Graph (RDD API)
        1. Code samples
      6. Basic graph operators (RDD API)
        1. List of graph operators (RDD API)
      7. Caching and uncaching of graphs
      8. Graph algorithms in GraphX
        1. PageRank
          1. Code example -- PageRank algorithm
        2. Connected components
          1. Code example -- connected components
        3. Triangle counting
      9. GraphFrames
        1. Why GraphFrames?
        2. Basic constructs of a GraphFrame
        3. Motif finding
        4. GraphFrames algorithms
        5. Loading and saving of GraphFrames
      10. Comparison between GraphFrames and GraphX
        1. GraphX <=> GraphFrames
          1. Converting from GraphFrame to GraphX
          2. Converting from GraphX to GraphFrames
      11. References
      12. Summary
    15. 8. Operating in Clustered Mode
      1. Clusters, nodes and daemons
        1. Key bits about Spark Architecture
      2. Running Spark in standalone mode
        1. Installing Spark standalone on a cluster
        2. Starting a Spark cluster manually
          1. Cluster overview
          2. Workers overview
          3. Running applications and drivers overview
          4. Completed applications and drivers overview
      3. Using the Cluster Launch Scripts to Start a Standalone Cluster
        1. Environment Properties
        2. Connecting Spark-Shell, PySpark, and R-Shell to the cluster
        3. Resource scheduling
      4. Running Spark in YARN
        1. Spark with a Hadoop Distribution (Cloudera)
          1. Interactive Shell
          2. Batch Application
        2. Important YARN Configuration Parameters
      5. Running Spark in Mesos
        1. Before you start
        2. Running in Mesos
        3. Modes of operation in Mesos
          1. Client Mode
          2. Batch Applications
          3. Interactive Applications
        4. Cluster Mode
          1. Steps to use the cluster mode
        5. Mesos run modes
        6. Key Spark on Mesos configuration properties
      6. References:
      7. Summary
    16. 9. Building a Recommendation System
      1. What is a recommendation system?
        1. Types of recommendations
          1. Manual recommendations
          2. Simple aggregated recommendations based on Popularity
          3. User-specific recommendations
      2. User specific recommendations
      3. Key issues with recommendation systems
        1. Gathering known input data
        2. Predicting unknown from known ratings
          1. Content-based recommendations
            1. Predicting unknown ratings
            2. Pros and cons of content based recommendations
          2. Collaborative filtering
            1. Jaccard similarity
            2. Cosine similarity
            3. Centered cosine (Pearson Correlation)
          3. Latent factor methods
            1. Evaluating prediction method
      4. Recommendation system in Spark
        1. Sample dataset
        2. How does Spark offer recommendation?
          1. Importing relevant libraries
          2. Defining the schema for ratings
          3. Defining the schema for movies
          4. Loading ratings and movies data
          5. Data partitioning
          6. Training an ALS model
          7. Predicting the test dataset
          8. Evaluating model performance
            1. Using implicit preferences
          9. Sanity checking
          10. Model Deployment
      5. References
      6. Summary
    17. 10. Customer Churn Prediction
      1. Overview of customer churn
      2. Why is predicting customer churn important?
      3. How do we predict customer churn with Spark?
        1. Data set description
        2. Code example
        3. Defining schema
        4. Loading data
        5. Data exploration
          1. PySpark import code
          2. Exploring international minutes
          3. Exploring night minutes
          4. Exploring day minutes
          5. Exploring eve minutes
        6. Comparing minutes data for churners and non-churners
        7. Comparing charge data for churners and non-churners
      4. Exploring customer service calls
        1. Scala code - constructing a scatter plot
          1. Exploring the churn variable
        2. Data transformation
        3. Building a machine learning pipeline
      5. References
      6. Summary
    18. Theres More with Spark
      1. Performance tuning
        1. Data serialization
        2. Memory tuning
          1. Execution and storage
          2. Tasks running in parallel
          3. Operators within the same task
          4. Memory management configuration options
          5. Memory tuning key tips
      2. I/O tuning
        1. Data locality
      3. Sizing up your executors
        1. Calculating memory overhead
          1. Setting aside memory/CPU for YARN application master
          2. I/O throughput
          3. Sample calculations
      4. The skew problem
      5. Security configuration in Spark
        1. Kerberos authentication
        2. Shared secrets
          1. Shared secret on YARN
          2. Shared secret on other cluster managers
      6. Setting up Jupyter Notebook with Spark
        1. What is a Jupyter Notebook?
        2. Setting up a Jupyter Notebook
          1. Securing the notebook server
          2. Preparing a hashed password
            1. Using Jupyter (only with version 5.0 and later)
            2. Manually creating hashed password
        3. Setting up PySpark on Jupyter
      7. Shared variables
        1. Broadcast variables
          1. Accumulators
      8. References
      9. Summary