O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hadoop 2.x Administration Cookbook

Book Description

Over 100 practical recipes to help you become an expert Hadoop administrator

About This Book

  • Become an expert Hadoop administrator and perform tasks to optimize your Hadoop Cluster
  • Import and export data into Hive and use Oozie to manage workflow.
  • Practical recipes will help you plan and secure your Hadoop cluster, and make it highly available

Who This Book Is For

If you are a system administrator with a basic understanding of Hadoop and you want to get into Hadoop administration, this book is for you. It’s also ideal if you are a Hadoop administrator who wants a quick reference guide to all the Hadoop administration-related tasks and solutions to commonly occurring problems

What You Will Learn

  • Set up the Hadoop architecture to run a Hadoop cluster smoothly
  • Maintain a Hadoop cluster on HDFS, YARN, and MapReduce
  • Understand high availability with Zookeeper and Journal Node
  • Configure Flume for data ingestion and Oozie to run various workflows
  • Tune the Hadoop cluster for optimal performance
  • Schedule jobs on a Hadoop cluster using the Fair and Capacity scheduler
  • Secure your cluster and troubleshoot it for various common pain points

In Detail

Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration.

The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster.

You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration.

By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.

Style and approach

This book contains short recipes that will help you run a Hadoop cluster efficiently. The recipes are solutions to real-life problems that administrators encounter while working with a Hadoop cluster

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Hadoop 2.x Administration Cookbook
    1. Table of Contents
    2. Hadoop 2.x Administration Cookbook
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Sections
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Conventions
      6. Reader feedback
      7. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Hadoop Architecture and Deployment
      1. Introduction
        1. Overview of Hadoop Architecture
      2. Building and compiling Hadoop
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Installation methods
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Setting up host resolution
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Installing a single-node cluster - HDFS components
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
          1. Setting up ResourceManager and NodeManager
      6. Installing a single-node cluster - YARN components
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. Installing a multi-node cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Configuring the Hadoop Gateway node
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Decommissioning nodes
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      10. Adding nodes to the cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    10. 2. Maintaining Hadoop Cluster HDFS
      1. Introduction
        1. Overview of HDFS
      2. Configuring HDFS block size
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Setting up Namenode metadata location
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Loading data in HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Configuring HDFS replication
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. HDFS balancer
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Quota configuration
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. HDFS health and FSCK
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Configuring rack awareness
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      10. Recycle or trash bin configuration
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      11. Distcp usage
        1. Getting ready
        2. How to do it...
        3. How it works...
      12. Control block report storm
        1. Getting ready
        2. How to do it...
        3. How it works...
      13. Configuring Datanode heartbeat
        1. Getting ready
        2. How to do it...
        3. How it works...
    11. 3. Maintaining Hadoop Cluster – YARN and MapReduce
      1. Introduction
      2. Running a simple MapReduce program
        1. Getting ready
        2. How to do it...
      3. Hadoop streaming
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Configuring YARN history server
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      5. Job history web interface and metrics
        1. Getting ready
        2. How to do it...
        3. How it works...
      6. Configuring ResourceManager components
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      7. YARN containers and resource allocations
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
        5. See also
      8. ResourceManager Web UI and JMX metrics
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Preserving ResourceManager states
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
    12. 4. High Availability
      1. Introduction
      2. Namenode HA using shared storage
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      3. ZooKeeper configuration
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Namenode HA using Journal node
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Resourcemanager HA using ZooKeeper
        1. Getting ready
        2. How to do it...
        3. How it works…
      6. Rolling upgrade with HA
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Configure shared cache manager
        1. Getting ready
        2. How to do it...
        3. There's more...
        4. See also
      8. Configure HDFS cache
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. HDFS snapshots
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Configuring storage based policies
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Configuring HA for Edge nodes
        1. Getting ready
        2. How to do it...
        3. How it works...
    13. 5. Schedulers
      1. Introduction
      2. Configuring users and groups
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      3. Fair Scheduler configuration
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Fair Scheduler pools
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Configuring job queues
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Job queue ACLs
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Configuring Capacity Scheduler
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Queuing mappings in Capacity Scheduler
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. YARN and Mapred commands
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. YARN label-based scheduling
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. YARN SLS
        1. Getting ready
        2. How to do it...
        3. How it works...
    14. 6. Backup and Recovery
      1. Introduction
      2. Initiating Namenode saveNamespace
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Using HDFS Image Viewer
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Fetching parameters which are in-effect
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Configuring HDFS and YARN logs
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Backing up and recovering Namenode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Configuring Secondary Namenode
        1. Getting ready
        2. How to do it...
        3. How it works…
      8. Promoting Secondary Namenode to Primary
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      9. Namenode recovery
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Namenode roll edits – online mode
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Namenode roll edits – offline mode
        1. Getting ready
        2. How to do it...
        3. How it works...
      12. Datanode recovery – disk full
        1. Getting ready
        2. How to do it...
        3. How it works...
      13. Configuring NFS gateway to serve HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
      14. Recovering deleted files
        1. Getting ready
        2. How to do it...
        3. How it works...
    15. 7. Data Ingestion and Workflow
      1. Introduction
      2. Hive server modes and setup
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Using MySQL for Hive metastore
        1. How to do it…
        2. How it works...
      4. Operating Hive with ZooKeeper
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Loading data into Hive
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Partitioning and Bucketing in Hive
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Hive metastore database
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Designing Hive with credential store
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Configuring Flume
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Configure Oozie and workflows
        1. Getting ready
        2. How to do it...
        3. How it works...
    16. 8. Performance Tuning
      1. Tuning the operating system
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      2. Tuning the disk
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Tuning the network
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Tuning HDFS
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Tuning Namenode
        1. Getting ready
        2. How to do it...
        3. There's more...
        4. See also
      6. Tuning Datanode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. Configuring YARN for performance
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Configuring MapReduce for performance
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Hive performance tuning
        1. Getting ready
        2. How to do it...
        3. There's more...
        4. How it works...
      10. Benchmarking Hadoop cluster
        1. Getting ready
        2. How to do it...
          1. Benchmark 1--Testing HDFS with TestDFSIO
          2. Benchmark 2--Stress testing Namenode
          3. Benchmark 3--MapReduce testing by generating small files
          4. Benchmark 4--TeraGen, TeraSort, and TeraValidate benchmarks
        3. There's more...
        4. How it works...
    17. 9. HBase Administration
      1. Introduction
      2. Setting up single node HBase cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Setting up multi-node HBase cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. Inserting data into HBase
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Integration with Hive
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. HBase administration commands
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      7. HBase backup and restore
        1. Getting ready
        2. How to do it...
        3. How it works...
      8. Tuning HBase
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. HBase upgrade
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Migrating data from MySQL to HBase using Sqoop
        1. Getting ready
        2. How to do it...
    18. 10. Cluster Planning
      1. Introduction
      2. Disk space calculations
        1. Getting ready
        2. How to do it...
        3. How it works...
      3. Nodes needed in the cluster
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      4. Memory requirements
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      5. Sizing the cluster as per SLA
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Network design
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Estimating the cost of the Hadoop cluster
        1. How to do it...
        2. How it works...
      8. Hardware and software options
        1. How it works...
    19. 11. Troubleshooting, Diagnostics, and Best Practices
      1. Introduction
      2. Namenode troubleshooting
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      3. Datanode troubleshooting
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      4. Resourcemanager troubleshooting
        1. Getting ready
        2. How to do it…
        3. How it works...
        4. See also
      5. Diagnose communication issues
        1. Getting ready
        2. How to do it...
        3. How it works...
      6. Parse logs for errors
        1. Getting ready
        2. How to do it...
        3. How it works...
      7. Hive troubleshooting
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. HBase troubleshooting
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Hadoop best practices
        1. How it works...
    20. 12. Security
      1. Introduction
      2. Encrypting disk using LUKS
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      3. Configuring Hadoop users
        1. Getting ready
        2. How to do it...
        3. How it works...
      4. HDFS encryption at Rest
        1. Getting ready
        2. How to do it...
        3. How it works...
      5. Configuring SSL in Hadoop
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. In-transit encryption
        1. Getting ready
        2. How to do it...
        3. There's more...
        4. See also
      7. Enabling service level authorization
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      8. Securing ZooKeeper
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Configuring auditing
        1. Getting ready
        2. How to do it...
        3. How it works...
      10. Configuring Kerberos server
        1. Getting ready
        2. How to do it...
        3. How it works...
      11. Configuring and enabling Kerberos for Hadoop
        1. Getting ready
        2. How to do it...
        3. How it works...
    21. Index