O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Expert Hadoop® Administration

Book Description

The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference

“Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.” –Paul Dix, Series Editor

In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.

Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run.

  • Understand Hadoop’s architecture from an administrator’s standpoint

  • Create simple and fully distributed clusters

  • Run MapReduce and Spark applications in a Hadoop cluster

  • Manage and protect Hadoop data and high availability

  • Work with HDFS commands, file permissions, and storage management

  • Move data, and use YARN to allocate resources and schedule jobs

  • Manage job workflows with Oozie and Hue

  • Secure, monitor, log, and optimize Hadoop

  • Benchmark and troubleshoot Hadoop

  • Table of Contents

    1. About This E-Book
    2. Title Page
    3. Copyright Page
    4. Dedication Page
    5. Contents
    6. Foreword
    7. Preface
      1. Who This Book Is For
      2. How This Book Is Structured and What It Covers
    8. Acknowledgments
    9. About the Author
    10. I: Introduction to Hadoop—Architecture and Hadoop Clusters
      1. 1. Introduction to Hadoop and Its Environment
        1. Hadoop—An Introduction
          1. Unique Features of Hadoop
          2. Big Data and Hadoop
          3. A Typical Scenario for Using Hadoop
          4. Traditional Database Systems
          5. Data Lake
          6. Big Data, Data Science and Hadoop
        2. Cluster Computing and Hadoop Clusters
          1. Cluster Computing
          2. Hadoop Clusters
        3. Hadoop Components and the Hadoop Ecosphere
        4. What Do Hadoop Administrators Do?
          1. Hadoop Administration—A New Paradigm
          2. What You Need to Know to Administer Hadoop
          3. The Hadoop Administrator’s Toolset
        5. Key Differences between Hadoop 1 and Hadoop 2
          1. Architectural Differences
          2. High-Availability Features
          3. Multiple Processing Engines
          4. Separation of Processing and Scheduling
          5. Resource Allocation in Hadoop 1 and Hadoop 2
        6. Distributed Data Processing: MapReduce and Spark, Hive and Pig
          1. MapReduce
          2. Apache Spark
          3. Apache Hive
          4. Apache Pig
        7. Data Integration: Apache Sqoop, Apache Flume and Apache Kafka
        8. Key Areas of Hadoop Administration
          1. Managing the Cluster Storage
          2. Allocating the Cluster Resources
          3. Scheduling Jobs
          4. Securing Hadoop Data
        9. Summary
      2. 2. An Introduction to the Architecture of Hadoop
        1. Distributed Computing and Hadoop
        2. Hadoop Architecture
          1. A Hadoop Cluster
          2. Master and Worker Nodes
          3. Hadoop Services
        3. Data Storage—The Hadoop Distributed File System
          1. HDFS Unique Features
          2. HDFS Architecture
          3. The HDFS File System
          4. NameNode Operations
        4. Data Processing with YARN, the Hadoop Operating System
          1. Architecture of YARN
          2. How the ApplicationMaster Works with the ResourceManager to Allocate Resources
        5. Summary
      3. 3. Creating and Configuring a Simple Hadoop Cluster
        1. Hadoop Distributions and Installation Types
          1. Hadoop Distributions
          2. Hadoop Installation Types
        2. Setting Up a Pseudo-Distributed Hadoop Cluster
          1. Meeting the Operating System Requirements
          2. Modifying Kernel Parameters
          3. Setting Up SSH
          4. Java Requirements
          5. Installing the Hadoop Software
          6. Creating the Necessary Hadoop Users
          7. Creating the Necessary Directories
        3. Performing the Initial Hadoop Configuration
          1. Environment Configuration Files
          2. Read-Only Default Configuration Files
          3. Site-Specific Configuration Files
          4. Other Hadoop-Related Configuration Files
          5. Precedence among the Configuration Files
          6. Variable Expansion and Configuration Parameters
          7. Configuring the Hadoop Daemons Environment
          8. Configuring Core Hadoop Properties (with the core-site.xml File)
          9. Configuring MapReduce (with the mapred-site.xml File)
          10. Configuring YARN (with the yarn-site.xml File)
        4. Operating the New Hadoop Cluster
          1. Formatting the Distributed File System
          2. Setting the Environment Variables
          3. Starting the HDFS and YARN Services
          4. Verifying the Service Startup
          5. Shutting Down the Services
        5. Summary
      4. 4. Planning for and Creating a Fully Distributed Cluster
        1. Planning Your Hadoop Cluster
          1. General Cluster Planning Considerations
          2. Server Form Factors
          3. Criteria for Choosing the Nodes
        2. Going from a Single Rack to Multiple Racks
          1. Sizing a Hadoop Cluster
          2. General Principles Governing the Choice of CPU, Memory and Storage
          3. Special Treatment for the Master Nodes
          4. Recommendations for Sizing the Servers
          5. Growing a Cluster
          6. Guidelines for Large Clusters
        3. Creating a Multinode Cluster
          1. How the Test Cluster Is Set Up
        4. Modifying the Hadoop Configuration
          1. Changing the HDFS Configuration (hdfs-site.xml file)
          2. Changing the YARN Configuration
          3. Changing the MapReduce Configuration
        5. Starting Up the Cluster
          1. Starting Up and Shutting Down the Cluster with Scripts
          2. Performing a Quick Check of the New Cluster’s File System
        6. Configuring Hadoop Services, Web Interfaces and Ports
          1. Service Configuration and Web Interfaces
          2. Setting Port Numbers for Hadoop Services
          3. Hadoop Clients
        7. Summary
    11. II: Hadoop Application Frameworks
      1. 5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig)
        1. The MapReduce Framework
          1. The MapReduce Model
          2. How MapReduce Works
          3. MapReduce Job Processing
          4. A Simple MapReduce Program
          5. Understanding Hadoop’s Job Processing—Running a WordCount Program
          6. MapReduce Input and Output Directories
          7. How Hadoop Shows You the Job Details
          8. Hadoop Streaming
        2. Apache Hive
          1. Hive Data Organization
          2. Working with Hive Tables
          3. Loading Data into Hive
          4. Querying with Hive
        3. Apache Pig
          1. Pig Execution Modes
          2. A Simple Pig Example
        4. Summary
      2. 6. Running Applications in a Cluster—The Spark Framework
        1. What Is Spark?
        2. Why Spark?
          1. Speed
          2. Ease of Use and Accessibility
          3. General-Purpose Framework
          4. Spark and Hadoop
        3. The Spark Stack
        4. Installing Spark
          1. Spark Examples
          2. Key Spark Files and Directories
          3. Compiling the Spark Binaries
          4. Reducing Spark’s Verbosity
        5. Spark Run Modes
          1. Local Mode
          2. Cluster Mode
        6. Understanding the Cluster Managers
          1. The Standalone Cluster Manager
          2. Spark on Apache Mesos
          3. Spark on YARN
          4. How YARN and Spark Work Together
          5. Setting Up Spark on a Hadoop Cluster
        7. Spark and Data Access
          1. Loading Data from the Linux File System
          2. Loading Data from HDFS
          3. Loading Data from a Relational Database
        8. Summary
      3. 7. Running Spark Applications
        1. The Spark Programming Model
          1. Spark Programming and RDDs
          2. Programming Spark
        2. Spark Applications
          1. Basics of RDDs
          2. Creating an RDD
          3. RDD Operations
          4. RDD Persistence
        3. Architecture of a Spark Application
          1. Spark Terminology
          2. Components of a Spark Application
        4. Running Spark Applications Interactively
          1. Spark Shell and Spark Applications
          2. A Bit about the Spark Shell
          3. Using the Spark Shell
          4. Overview of Spark Cluster Execution
        5. Creating and Submitting Spark Applications
          1. Building the Spark Application
          2. Running an Application in the Standalone Spark Cluster
          3. Using spark-submit to Execute Applications
          4. Running Spark Applications on Mesos
          5. Running Spark Applications in a YARN-Managed Hadoop Cluster
          6. Using the JDBC/ODBC Server
        6. Configuring Spark Applications
          1. Spark Configuration Properties
          2. Specifying Configuration when Running spark-submit
        7. Monitoring Spark Applications
        8. Handling Streaming Data with Spark Streaming
          1. How Spark Streaming Works
          2. A Spark Streaming Example—WordCount Again!
        9. Using Spark SQL for Handling Structured Data
          1. DataFrames
          2. HiveContext and SQLContext
          3. Working with Spark SQL
          4. Creating DataFrames
        10. Summary
    12. III: Managing and Protecting Hadoop Data and High Availability
      1. 8. The Role of the NameNode and How HDFS Works
        1. HDFS—The Interaction between the NameNode and the DataNodes
          1. Interaction between the Clients and HDFS
          2. NameNode and DataNode Communications
        2. Rack Awareness and Topology
          1. How to Configure Rack Awareness in Your Cluster
          2. Finding Your Cluster’s Rack Information
        3. HDFS Data Replication
          1. HDFS Data Organization and Data Blocks
          2. Data Replication
          3. Block and Replica States
        4. How Clients Read and Write HDFS Data
          1. How Clients Read HDFS Data
          2. How Clients Write Data to HDFS
        5. Understanding HDFS Recovery Processes
          1. Generation Stamp
          2. Lease Recovery
          3. Block Recovery
          4. Pipeline Recovery
        6. Centralized Cache Management in HDFS
          1. Hadoop and OS Page Caching
          2. The Key Principles Behind Centralized Cache Management
          3. How Centralized Cache Management Works
          4. Configuring Caching
          5. Cache Directives
          6. Cache Pools
          7. Using the Cache
        7. Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
          1. Performance Characteristics of Storage Types
          2. The Need for Heterogeneous HDFS Storage
          3. Changes in the Storage Architecture
          4. Storage Preferences for Files
          5. Setting Up Archival Storage
          6. Managing Storage Policies
          7. Moving Data Around
          8. Implementing Archival Storage
        8. Summary
      2. 9. HDFS Commands, HDFS Permissions and HDFS Storage
        1. Managing HDFS through the HDFS Shell Commands
          1. Using the hdfs dfs Utility to Manage HDFS
          2. Listing HDFS Files and Directories
          3. Creating an HDFS Directory
          4. Removing HDFS Files and Directories
          5. Changing File and Directory Ownership and Groups
        2. Using the dfsadmin Utility to Perform HDFS Operations
          1. The dfsadmin –report Command
        3. Managing HDFS Permissions and Users
          1. HDFS File Permissions
          2. HDFS Users and Super Users
        4. Managing HDFS Storage
          1. Checking HDFS Disk Usage
          2. Allocating HDFS Space Quotas
        5. Rebalancing HDFS Data
          1. Reasons for HDFS Data Imbalance
          2. Running the Balancer Tool to Balance HDFS Data
          3. Using hdfs dfsadmin to Make Things Easier
          4. When to Run the Balancer
        6. Reclaiming HDFS Space
          1. Removing Files and Directories
          2. Decreasing the Replication Factor
        7. Summary
      3. 10. Data Protection, File Formats and Accessing HDFS
        1. Safeguarding Data
          1. Using HDFS Trash to Prevent Accidental Data Deletion
          2. Using HDFS Snapshots to Protect Important Data
          3. Ensuring Data Integrity with File System Checks
        2. Data Compression
          1. Common Compression Formats
          2. Evaluating the Various Compression Schemes
          3. Compression at Various Stages for MapReduce
          4. Compression for Spark
          5. Data Serialization
        3. Hadoop File Formats
          1. Criteria for Determining the Right File Format
          2. File Formats Supported by Hadoop
          3. The Ideal File Format
          4. The Hadoop Small Files Problem and Merging Files
          5. Using a Federated NameNode to Overcome the Small Files Problem
          6. Using Hadoop Archives to Manage Many Small Files
          7. Handling the Performance Impact of Small Files
        4. Using Hadoop WebHDFS and HttpFS
          1. WebHDFS—The Hadoop REST API
          2. Using the WebHDFS API
          3. Understanding the WebHDFS Commands
          4. Using HttpFS Gateway to Access HDFS from Behind a Firewall
        5. Summary
      4. 11. NameNode Operations, High Availability and Federation
        1. Understanding NameNode Operations
          1. HDFS Metadata
          2. The NameNode Startup Process
          3. How the NameNode and the DataNodes Work Together
        2. The Checkpointing Process
          1. Secondary, Checkpoint, Backup and Standby Nodes
          2. Configuring the Checkpointing Frequency
          3. Managing Checkpoint Performance
          4. The Mechanics of Checkpointing
        3. NameNode Safe Mode Operations
          1. Automatic Safe Mode Operations
          2. Placing the NameNode in Safe Mode
          3. How the NameNode Transitions Through Safe Mode
          4. Backing Up and Recovering the NameNode Metadata
        4. Configuring HDFS High Availability
          1. NameNode HA Architecture (QJM)
          2. Setting Up an HDFS HA Quorum Cluster
          3. Deploying the High-Availability NameNodes
          4. Managing an HA NameNode Setup
          5. HA Manual and Automatic Failover
        5. HDFS Federation
          1. Architecture of a Federated NameNode
        6. Summary
    13. IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
      1. 12. Moving Data Into and Out of Hadoop
        1. Introduction to Hadoop Data Transfer Tools
        2. Loading Data into HDFS from the Command Line
          1. Using the -cat Command to Dump a File’s Contents
          2. Testing HDFS Files
          3. Copying and Moving Files from and to HDFS
          4. Using the -get Command to Move Files
          5. Moving Files from and to HDFS
          6. Using the -tail and head Commands
        3. Copying HDFS Data between Clusters with DistCp
          1. How to Use the DistCp Command to Move Data
          2. DistCp Options
        4. Ingesting Data from Relational Databases with Sqoop
          1. Sqoop Architecture
          2. Deploying Sqoop
          3. Using Sqoop to Move Data
          4. Importing Data with Sqoop
          5. Importing Data into Hive
          6. Exporting Data with Sqoop
        5. Ingesting Data from External Sources with Flume
          1. Flume Architecture in a Nutshell
          2. Configuring the Flume Agent
          3. A Simple Flume Example
          4. Using Flume to Move Data to HDFS
          5. A More Complex Flume Example
        6. Ingesting Data with Kafka
          1. Benefits Offered by Kafka
          2. How Kafka Works
          3. Setting Up an Apache Kafka Cluster
          4. Integrating Kafka with Hadoop and Storm
        7. Summary
      2. 13. Resource Allocation in a Hadoop Cluster
        1. Resource Allocation in Hadoop
          1. Managing Cluster Workloads
          2. Hadoop’s Resource Schedulers
        2. The FIFO Scheduler
        3. The Capacity Scheduler
          1. Queues and Subqueues
          2. How the Cluster Allocates Resources
          3. Preempting Applications
          4. Enabling the Capacity Scheduler
          5. A Typical Capacity Scheduler
        4. The Fair Scheduler
          1. Queues
          2. Configuring the Fair Scheduler
          3. How Jobs Are Placed into Queues
          4. Application Preemption in the Fair Scheduler
          5. Security and Resource Pools
          6. A Sample fair-scheduler.xml File
          7. Submitting Jobs to the Scheduler
          8. Moving Applications between Queues
          9. Monitoring the Fair Scheduler
        5. Comparing the Capacity Scheduler and the Fair Scheduler
          1. Similarities between the Two Schedulers
          2. Differences between the Two Schedulers
        6. Summary
      3. 14. Working with Oozie to Manage Job Workflows
        1. Using Apache Oozie to Schedule Jobs
        2. Oozie Architecture
          1. The Oozie Server
          2. The Oozie Client
          3. The Oozie Database
        3. Deploying Oozie in Your Cluster
          1. Installing and Configuring Oozie
          2. Configuring Hadoop for Oozie
        4. Understanding Oozie Workflows
          1. Workflows, Control Flow, and Nodes
          2. Defining the Workflows with the workflow.xml File
        5. How Oozie Runs an Action
          1. Configuring the Action Nodes
        6. Creating an Oozie Workflow
          1. Configuring the Control Nodes
          2. Configuring the Job
        7. Running an Oozie Workflow Job
          1. Specifying the Job Properties
          2. Deploying Oozie Jobs
          3. Creating Dynamic Workflows
        8. Oozie Coordinators
          1. Time-Based Coordinators
          2. Data-Based Coordinators
          3. Time-and-Data-Based Coordinators
          4. Submitting the Oozie Coordinator from the Command Line
        9. Managing and Administering Oozie
          1. Common Oozie Commands and How to Run Them
          2. Troubleshooting Oozie
          3. Oozie cron Scheduling and Oozie Service Level Agreements
        10. Summary
      4. 15. Securing Hadoop
        1. Hadoop Security—An Overview
          1. Authentication, Authorization and Accounting
        2. Hadoop Authentication with Kerberos
          1. Kerberos and How It Works
          2. The Kerberos Authentication Process
          3. Kerberos Trusts
          4. A Special Principal
          5. Adding Kerberos Authorization to your Cluster
          6. Setting Up Kerberos for Hadoop
          7. Securing a Hadoop Cluster with Kerberos
          8. How Kerberos Authenticates Users and Services
          9. Managing a Kerberized Hadoop Cluster
        3. Hadoop Authorization
          1. HDFS Permissions
          2. Service Level Authorization
          3. Role-Based Authorization with Apache Sentry
        4. Auditing Hadoop
          1. Auditing HDFS Operations
          2. Auditing YARN Operations
        5. Securing Hadoop Data
          1. HDFS Transparent Encryption
          2. Encrypting Data in Transition
        6. Other Hadoop-Related Security Initiatives
          1. Securing a Hadoop Infrastructure with Apache Knox Gateway
          2. Apache Ranger for Security Administration
        7. Summary
    14. V: Monitoring, Optimization and Troubleshooting
      1. 16. Managing Jobs, Using Hue and Performing Routine Tasks
        1. Using the YARN Commands to Manage Hadoop Jobs
          1. Viewing YARN Applications
          2. Checking the Status of an Application
          3. Killing a Running Application
          4. Checking the Status of the Nodes
          5. Checking YARN Queues
          6. Getting the Application Logs
          7. Yarn Administrative Commands
        2. Decommissioning and Recommissioning Nodes
          1. Including and Excluding Hosts
          2. Decommissioning DataNodes and NodeManagers
          3. Recommissioning Nodes
          4. Things to Remember about Decommissioning and Recommissioning
          5. Adding a New DataNode and/or a NodeManager
        3. ResourceManager High Availability
          1. ResourceManager High-Availability Architecture
          2. Setting Up ResourceManager High Availability
          3. ResourceManager Failover
          4. Using the ResourceManager High-Availability Commands
        4. Performing Common Management Tasks
          1. Moving the NameNode to a Different Host
          2. Managing High-Availability NameNodes
          3. Using a Shutdown/Startup Script to Manage your Cluster
          4. Balancing HDFS
          5. Balancing the Storage on the DataNodes
        5. Managing the MySQL Database
          1. Configuring a MySQL Database
          2. Configuring MySQL High Availability
        6. Backing Up Important Cluster Data
          1. Backing Up HDFS Metadata
          2. Backing Up the Metastore Databases
        7. Using Hue to Administer Your Cluster
          1. Allowing Your Users to Use Hue
          2. Installing Hue
          3. Configuring Your Cluster to Work with Hue
          4. Managing Hue
          5. Working with Hue
        8. Implementing Specialized HDFS Features
          1. Deploying HDFS and YARN in a Multihomed Network
          2. Short-Circuit Local Reads
          3. Mountable HDFS
          4. Using an NFS Gateway for Mounting HDFS to a Local File System
        9. Summary
      2. 17. Monitoring, Metrics and Hadoop Logging
        1. Monitoring Linux Servers
          1. Basics of Linux System Monitoring
          2. Monitoring Tools for Linux Systems
        2. Hadoop Metrics
          1. Hadoop Metric Types
          2. Using the Hadoop Metrics
          3. Capturing Metrics to a File System
        3. Using Ganglia for Monitoring
          1. Ganglia Architecture
          2. Setting Up the Ganglia and Hadoop Integration
          3. Setting Up the Hadoop Metrics
        4. Understanding Hadoop Logging
          1. Hadoop Log Messages
          2. Daemon and Application Logs and How to View Them
          3. How Application Logging Works
          4. How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
          5. How the NodeManager Uses the Local Directories
          6. Storing Job Logs in HDFS through Log Aggregation
          7. Working with the Hadoop Daemon Logs
        5. Using Hadoop’s Web UIs for Monitoring
          1. Monitoring Jobs with the ResourceManager Web UI
          2. The JobHistoryServer Web UI
          3. Monitoring with the NameNode Web UI
        6. Monitoring Other Hadoop Components
          1. Monitoring Hive
          2. Monitoring Spark
        7. Summary
      3. 18. Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking
        1. How to Allocate YARN Memory and CPU
          1. Allocating Memory
          2. Configuring the Number of CPU Cores
          3. Relationship between Memory and CPU Vcores
        2. Configuring Efficient Performance
          1. Speculative Execution
          2. Reducing the I/O Load on the System
        3. Tuning Map and Reduce Tasks—What the Administrator Can Do
          1. Tuning the Map Tasks
          2. Input and Output
          3. Tuning the Reduce Tasks
          4. Tuning the MapReduce Shuffle Process
        4. Optimizing Pig and Hive Jobs
          1. Optimizing Hive Jobs
          2. Optimizing Pig Jobs
        5. Benchmarking Your Cluster
          1. Using TestDFSIO for Testing I/O Performance
          2. Benchmarking with TeraSort
          3. Using Hadoop’s Rumen and GridMix for Benchmarking
        6. Hadoop Counters
          1. File System Counters
          2. Job Counters
          3. MapReduce Framework Counters
          4. Custom Java Counters
          5. Limiting the Number of Counters
        7. Optimizing MapReduce
          1. Map-Only versus Map and Reduce Jobs
          2. How Combiners Improve MapReduce Performance
          3. Using a Partitioner to Improve Performance
          4. Compressing Data During the MapReduce Process
          5. Too Many Mappers or Reducers?
        8. Summary
      4. 19. Configuring and Tuning Apache Spark on YARN
        1. Configuring Resource Allocation for Spark on YARN
          1. Allocating CPU
          2. Allocating Memory
          3. How Resources are Allocated to Spark
          4. Limits on the Resource Allocation to Spark Applications
          5. Allocating Resources to the Driver
          6. Configuring Resources for the Executors
          7. How Spark Uses its Memory
          8. Things to Remember
          9. Cluster or Client Mode?
          10. Configuring Spark-Related Network Parameters
        2. Dynamic Resource Allocation when Running Spark on YARN
          1. Dynamic and Static Resource Allocation
          2. How Spark Manages Dynamic Resource Allocation
          3. Enabling Dynamic Resource Allocation
        3. Storage Formats and Compressing Data
          1. Storage Formats
          2. File Sizes
          3. Compression
        4. Monitoring Spark Applications
          1. Using the Spark Web UI to Understand Performance
          2. Spark System and the Metrics REST API
          3. The Spark History Server on YARN
          4. Tracking Jobs from the Command Line
        5. Tuning Garbage Collection
          1. The Mechanics of Garbage Collection
          2. How to Collect GC Statistics
        6. Tuning Spark Streaming Applications
          1. Reducing Batch Processing Time
          2. Setting the Right Batch Interval
          3. Tuning Memory and Garbage Collection
        7. Summary
      5. 20. Optimizing Spark Applications
        1. Revisiting the Spark Execution Model
          1. The Spark Execution Model
        2. Shuffle Operations and How to Minimize Them
          1. A WordCount Example to Our Rescue Again
          2. Impact of a Shuffle Operation
          3. Configuring the Shuffle Parameters
        3. Partitioning and Parallelism (Number of Tasks)
          1. Level of Parallelism
          2. Problems with Too Few Tasks
          3. Setting the Default Number of Partitions
          4. How to Increase the Number of Partitions
          5. Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD
          6. Two Types of Partitioners
          7. Data Partitioning and How It Can Avoid a Shuffle
        4. Optimizing Data Serialization and Compression
          1. Data Serialization
          2. Configuring Compression
        5. Understanding Spark’s SQL Query Optimizer
          1. Understanding the Optimizer Steps
          2. Spark’s Speculative Execution Feature
          3. The Importance of Data Locality
        6. Caching Data
          1. Fault-Tolerance Due to Caching
          2. How to Specify Caching
        7. Summary
      6. 21. Troubleshooting Hadoop—A Sampler
        1. Space-Related Issues
          1. Dealing with a 100 Percent Full Linux File System
          2. HDFS Space Issues
          3. Local and Log Directories Out of Free Space
          4. Disk Volume Failure Toleration
        2. Handling YARN Jobs That Are Stuck
        3. JVM Memory-Allocation and Garbage-Collection Strategies
          1. Understanding JVM Garbage Collection
          2. Optimizing Garbage Collection
          3. Analyzing Memory Usage
          4. Out of Memory Errors
          5. ApplicationMaster Memory Issues
        4. Handling Different Types of Failures
          1. Handling Daemon Failures
          2. Starting Failures for Hadoop Daemons
          3. Task and Job Failures
        5. Troubleshooting Spark Jobs
          1. Spark’s Fault Tolerance Mechanism
          2. Killing Spark Jobs
          3. Maximum Attempts for a Job
          4. Maximum Failures per Job
        6. Debugging Spark Applications
          1. Viewing Logs with Log Aggregation
          2. Viewing Logs When Log Aggregation Is Not Enabled
          3. Reviewing the Launch Environment
        7. Summary
    15. A. Installing VirtualBox and Linux and Cloning the Virtual Machines
      1. Installing Oracle VirtualBox
      2. Installing Oracle Enterprise Linux
      3. Cloning the Linux Server
    16. Index
    17. Code Snippets