O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Big Data Analytics

Book Description

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clusters

About This Book

  • This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.
  • Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.
  • Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.

Who This Book Is For

Though this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.

What You Will Learn

  • Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop
  • Understand all the Hadoop and Spark ecosystem components
  • Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx
  • See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming
  • Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.

In Detail

Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.

It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.

Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.

Style and approach

This step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Big Data Analytics
    1. Table of Contents
    2. Big Data Analytics
    3. Credits
    4. About the Author
    5. Acknowledgement
    6. About the Reviewers
    7. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Big Data Analytics at a 10,000-Foot View
      1. Big Data analytics and the role of Hadoop and Spark
        1. A typical Big Data analytics project life cycle
          1. Identifying the problem and outcomes
          2. Identifying the necessary data
          3. Data collection
          4. Preprocessing data and ETL
          5. Performing analytics
          6. Visualizing data
        2. The role of Hadoop and Spark
      2. Big Data science and the role of Hadoop and Spark
        1. A fundamental shift from data analytics to data science
          1. Data scientists versus software engineers
          2. Data scientists versus data analysts
          3. Data scientists versus business analysts
        2. A typical data science project life cycle
          1. Hypothesis and modeling
          2. Measuring the effectiveness
          3. Making improvements
          4. Communicating the results
        3. The role of Hadoop and Spark
      3. Tools and techniques
      4. Real-life use cases
      5. Summary
    10. 2. Getting Started with Apache Hadoop and Apache Spark
      1. Introducing Apache Hadoop
        1. Hadoop Distributed File System
        2. Features of HDFS
        3. MapReduce
        4. MapReduce features
        5. MapReduce v1 versus MapReduce v2
          1. MapReduce v1 challenges
        6. YARN
        7. Storage options on Hadoop
          1. File formats
            1. Sequence file
            2. Protocol buffers and thrift
            3. Avro
            4. Parquet
            5. RCFile and ORCFile
          2. Compression formats
            1. Standard compression formats
      2. Introducing Apache Spark
        1. Spark history
        2. What is Apache Spark?
        3. What Apache Spark is not
        4. MapReduce issues
        5. Spark's stack
      3. Why Hadoop plus Spark?
        1. Hadoop features
        2. Spark features
          1. Frequently asked questions about Spark
      4. Installing Hadoop plus Spark clusters
      5. Summary
    11. 3. Deep Dive into Apache Spark
      1. Starting Spark daemons
        1. Working with CDH
        2. Working with HDP, MapR, and Spark pre-built packages
      2. Learning Spark core concepts
        1. Ways to work with Spark
          1. Spark Shell
            1. Exploring the Spark Scala shell
          2. Spark applications
            1. Connecting to the Kerberos Security Enabled Spark Cluster
        2. Resilient Distributed Dataset
          1. Method 1 – parallelizing a collection
          2. Method 2 – reading from a file
            1. Reading files from HDFS
            2. Reading files from HDFS with HA enabled
        3. Spark context
        4. Transformations and actions
        5. Parallelism in RDDs
        6. Lazy evaluation
        7. Lineage Graph
        8. Serialization
        9. Leveraging Hadoop file formats in Spark
        10. Data locality
        11. Shared variables
        12. Pair RDDs
      3. Lifecycle of Spark program
        1. Pipelining
        2. Spark execution summary
      4. Spark applications
        1. Spark Shell versus Spark applications
        2. Creating a Spark context
        3. SparkConf
        4. SparkSubmit
        5. Spark Conf precedence order
        6. Important application configurations
      5. Persistence and caching
        1. Storage levels
        2. What level to choose?
      6. Spark resource managers – Standalone, YARN, and Mesos
        1. Local versus cluster mode
        2. Cluster resource managers
          1. Standalone
          2. YARN
            1. Dynamic resource allocation
            2. Client mode versus cluster mode
          3. Mesos
          4. Which resource manager to use?
      7. Summary
    12. 4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
      1. History of Spark SQL
      2. Architecture of Spark SQL
      3. Introducing SQL, Datasources, DataFrame, and Dataset APIs
      4. Evolution of DataFrames and Datasets
        1. What's wrong with RDDs?
        2. RDD Transformations versus Dataset and DataFrames Transformations
      5. Why Datasets and DataFrames?
        1. Optimization
        2. Speed
        3. Automatic Schema Discovery
        4. Multiple sources, multiple languages
        5. Interoperability between RDDs and others
        6. Select and read necessary data only
      6. When to use RDDs, Datasets, and DataFrames?
      7. Analytics with DataFrames
        1. Creating SparkSession
        2. Creating DataFrames
          1. Creating DataFrames from structured data files
          2. Creating DataFrames from RDDs
          3. Creating DataFrames from tables in Hive
          4. Creating DataFrames from external databases
        3. Converting DataFrames to RDDs
        4. Common Dataset/DataFrame operations
          1. Input and Output Operations
          2. Basic Dataset/DataFrame functions
          3. DSL functions
          4. Built-in functions, aggregate functions, and window functions
          5. Actions
          6. RDD operations
        5. Caching data
        6. Performance optimizations
      8. Analytics with the Dataset API
        1. Creating Datasets
        2. Converting a DataFrame to a Dataset
          1. Converting a Dataset to a DataFrame
        3. Accessing metadata using Catalog
      9. Data Sources API
        1. Read and write functions
        2. Built-in sources
          1. Working with text files
          2. Working with JSON
          3. Working with Parquet
          4. Working with ORC
          5. Working with JDBC
          6. Working with CSV
        3. External sources
          1. Working with AVRO
          2. Working with XML
          3. Working with Pandas
          4. DataFrame based Spark-on-HBase connector
      10. Spark SQL as a distributed SQL engine
        1. Spark SQL's Thrift server for JDBC/ODBC access
        2. Querying data using beeline client
        3. Querying data from Hive using spark-sql CLI
        4. Integration with BI tools
      11. Hive on Spark
      12. Summary
    13. 5. Real-Time Analytics with Spark Streaming and Structured Streaming
      1. Introducing real-time processing
        1. Pros and cons of Spark Streaming
        2. History of Spark Streaming
      2. Architecture of Spark Streaming
        1. Spark Streaming application flow
        2. Stateless and stateful stream processing
      3. Spark Streaming transformations and actions
        1. Union
        2. Join
        3. Transform operation
        4. updateStateByKey
        5. mapWithState
        6. Window operations
        7. Output operations
      4. Input sources and output stores
        1. Basic sources
        2. Advanced sources
        3. Custom sources
        4. Receiver reliability
        5. Output stores
      5. Spark Streaming with Kafka and HBase
        1. Receiver-based approach
          1. Role of Zookeeper
        2. Direct approach (no receivers)
        3. Integration with HBase
      6. Advanced concepts of Spark Streaming
        1. Using DataFrames
        2. MLlib operations
        3. Caching/persistence
        4. Fault-tolerance in Spark Streaming
          1. Failure of executor
          2. Failure of driver
            1. Recovering with checkpointing
            2. Recovering with WAL
        5. Performance tuning of Spark Streaming applications
      7. Monitoring applications
      8. Introducing Structured Streaming
        1. Structured Streaming application flow
          1. When to use Structured Streaming?
        2. Streaming Datasets and Streaming DataFrames
          1. Input sources and output sinks
        3. Operations on Streaming Datasets and Streaming DataFrames
      9. Summary
    14. 6. Notebooks and Dataflows with Spark and Hadoop
      1. Introducing web-based notebooks
      2. Introducing Jupyter
        1. Installing Jupyter
        2. Analytics with Jupyter
      3. Introducing Apache Zeppelin
        1. Jupyter versus Zeppelin
        2. Installing Apache Zeppelin
          1. Ambari service
          2. The manual method
        3. Analytics with Zeppelin
      4. The Livy REST job server and Hue Notebooks
        1. Installing and configuring the Livy server and Hue
        2. Using the Livy server
          1. An interactive session
          2. A batch session
          3. Sharing SparkContexts and RDDs
        3. Using Livy with Hue Notebook
        4. Using Livy with Zeppelin
      5. Introducing Apache NiFi for dataflows
        1. Installing Apache NiFi
        2. Dataflows and analytics with NiFi
      6. Summary
    15. 7. Machine Learning with Spark and Hadoop
      1. Introducing machine learning
      2. Machine learning on Spark and Hadoop
      3. Machine learning algorithms
        1. Supervised learning
        2. Unsupervised learning
        3. Recommender systems
        4. Feature extraction and transformation
        5. Optimization
        6. Spark MLlib data types
      4. An example of machine learning algorithms
        1. Logistic regression for spam detection
      5. Building machine learning pipelines
        1. An example of a pipeline workflow
        2. Building an ML pipeline
        3. Saving and loading models
      6. Machine learning with H2O and Spark
        1. Why Sparkling Water?
        2. An application flow on YARN
        3. Getting started with Sparkling Water
      7. Introducing Hivemall
      8. Introducing Hivemall for Spark
      9. Summary
    16. 8. Building Recommendation Systems with Spark and Mahout
      1. Building recommendation systems
        1. Content-based filtering
        2. Collaborative filtering
          1. User-based collaborative filtering
          2. Item-based collaborative filtering
      2. Limitations of a recommendation system
      3. A recommendation system with MLlib
        1. Preparing the environment
        2. Creating RDDs
        3. Exploring the data with DataFrames
        4. Creating training and testing datasets
        5. Creating a model
        6. Making predictions
        7. Evaluating the model with testing data
        8. Checking the accuracy of the model
        9. Explicit versus implicit feedback
      4. The Mahout and Spark integration
        1. Installing Mahout
        2. Exploring the Mahout shell
        3. Building a universal recommendation system with Mahout and search tool
      5. Summary
    17. 9. Graph Analytics with GraphX
      1. Introducing graph processing
        1. What is a graph?
        2. Graph databases versus graph processing systems
        3. Introducing GraphX
        4. Graph algorithms
      2. Getting started with GraphX
        1. Basic operations of GraphX
          1. Creating a graph
          2. Counting
          3. Filtering
          4. inDegrees, outDegrees, and degrees
          5. Triplets
        2. Transforming graphs
          1. Transforming attributes
          2. Modifying graphs
          3. Joining graphs
          4. VertexRDD and EdgeRDD operations
            1. Mapping VertexRDD and EdgeRDD
            2. Filtering VertexRDDs
            3. Joining VertexRDDs
            4. Joining EdgeRDDs
            5. Reversing edge directions
        3. GraphX algorithms
          1. Triangle counting
          2. Connected components
      3. Analyzing flight data using GraphX
        1. Pregel API
      4. Introducing GraphFrames
        1. Motif finding
        2. Loading and saving GraphFrames
      5. Summary
    18. 10. Interactive Analytics with SparkR
      1. Introducing R and SparkR
        1. What is R?
        2. Introducing SparkR
        3. Architecture of SparkR
      2. Getting started with SparkR
        1. Installing and configuring R
        2. Using SparkR shell
          1. Local mode
          2. Standalone mode
          3. Yarn mode
          4. Creating a local DataFrame
          5. Creating a DataFrame from a DataSources API
          6. Creating a DataFrame from Hive
        3. Using SparkR scripts
      3. Using DataFrames with SparkR
      4. Using SparkR with RStudio
      5. Machine learning with SparkR
        1. Using the Naive Bayes model
        2. Using the k-means model
      6. Using SparkR with Zeppelin
      7. Summary
    19. Index