Cloudera Administration Handbook

Book description

A complete, hands-on guide to building and maintaining large Apache Hadoop clusters using Cloudera Manager and CDH5

In Detail

Apache Hadoop is an open source distributed computing technology that assists users in processing large volumes of data with relative ease, helping them to generate tremendous insights into their data. Cloudera, with their open source distribution of Hadoop, has made data analytics on big data possible and accessible to anyone interested.

This book fully prepares you to be a Hadoop administrator, with special emphasis on Cloudera's CDH. It provides step-by-step instructions on setting up and managing a robust Hadoop cluster running CDH5. This book will also equip you with an understanding of tools such as Cloudera Manager, which is currently being used by many companies to manage Hadoop clusters with hundreds of nodes. You will learn how to set up security using Kerberos. You will also use Cloudera Manager to set up alerts and events that will help you monitor and troubleshoot cluster issues.

What You Will Learn

  • Understand the Apache Hadoop architecture and the future of distributed processing frameworks
  • Use HDFS and MapReduce for all file-related operations
  • Install and configure CDH to bring up an Apache Hadoop cluster
  • Configure HDFS High Availability and HDFS Federation to prevent single points of failure
  • Install and configure Cloudera Manager to perform administrator operations
  • Implement security by installing and configuring Kerberos for all services in the cluster
  • Add, remove, and rebalance nodes in a cluster using cluster management tools
  • Understand and configure the different backup options to back up your HDFS
=

Table of contents

  1. Cloudera Administration Handbook
    1. Table of Contents
    2. Cloudera Administration Handbook
    3. Credits
    4. Notice
    5. About the Author
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Getting Started with Apache Hadoop
      1. History of Apache Hadoop and its trends
      2. Components of Apache Hadoop
      3. Understanding the Apache Hadoop daemons
        1. Namenode
        2. Secondary namenode
        3. Jobtracker
        4. Tasktracker
        5. ResourceManager
        6. NodeManager
        7. Job submission in YARN
      4. Introducing Cloudera
      5. Introducing CDH
      6. Responsibilities of a Hadoop administrator
      7. Summary
    10. 2. HDFS and MapReduce
      1. Essentials of HDFS
        1. Configuring HDFS
      2. The read/write operational flow in HDFS
        1. Writing files in HDFS
        2. Reading files in HDFS
      3. Understanding the namenode UI
      4. Understanding the secondary namenode UI
      5. Exploring HDFS commands
        1. Commonly used HDFS commands
        2. Commands to administer HDFS
      6. Getting acquainted with MapReduce
        1. Understanding the map phase
        2. Understanding the reduce phase
        3. Learning all about the MapReduce job flow
          1. Configuring MapReduce
        4. Understanding the jobtracker UI
        5. Getting MapReduce job information
      7. Summary
    11. 3. Cloudera's Distribution Including Apache Hadoop
      1. Getting started with CDH
      2. Understanding the CDH components
        1. Apache Hadoop
        2. Apache Flume NG
        3. Apache Sqoop
        4. Apache Pig
        5. Apache Hive
        6. Apache ZooKeeper
        7. Apache HBase
        8. Apache Whirr
        9. Snappy – previously known as Zippy
        10. Apache Mahout
        11. Apache Avro
        12. Apache Oozie
        13. Cloudera Search
        14. Cloudera Impala
        15. Cloudera Hue
          1. Beeswax – Hive UI
          2. Cloudera Impala UI
          3. Pig UI
          4. File Browser
          5. Metastore Manager
          6. Sqoop Jobs
          7. Job Browser
          8. Job Designs
          9. Dashboard
          10. Collection Manager
          11. Hue Shell
          12. HBase Browser
      3. Installing CDH
        1. Stopping Hadoop services
        2. Understanding a YARN cluster
      4. Installing the CDH components
        1. Installing Apache Flume
        2. Installing Apache Sqoop
        3. Installing Apache Sqoop 2
        4. Installing Apache Pig
        5. Installing Apache Hive
        6. Installing Apache Oozie
        7. Installing Apache ZooKeeper
      5. Summary
    12. 4. Exploring HDFS Federation and Its High Availability
      1. Implementing HDFS Federation
        1. Configuring HDFS Federation
          1. Configuring ViewFS for a federated HDFS
      2. Implementing HDFS High Availability
        1. The Quorum-based storage
          1. Configuring HDFS high availability by theQuorum-based storage
        2. Shared storage using NFS
          1. Configuring HDFS high availability by shared storage using NFS
            1. NameNode Journal Status for Quorum-based storage approach
            2. NameNode Journal Status for the Shared Storage-based approach
        3. Configuring automatic failover for HDFS high availability
      3. Jobtracker high availability
        1. Configuring jobtracker high availability
        2. Configuring automatic failover for jobtracker high availability
      4. Summary
    13. 5. Using Cloudera Manager
      1. Introducing Cloudera Manager
      2. Understanding the Cloudera Manager architecture
      3. Installing Cloudera Manager
      4. Navigating the Cloudera Manager Web console
        1. Navigating the Home screen
        2. Navigating the Clusters menu
        3. Exploring the Hosts menu
        4. Understanding the Diagnostics menu
        5. Understanding the Audits screen
        6. Understanding the Charts menu
        7. Understanding the Backup menu
        8. Understanding the Administration menu
      5. Configuring High Availability using Cloudera Manager
      6. Summary
    14. 6. Implementing Security Using Kerberos
      1. Understanding authentication and authorization
      2. Introducing Kerberos
      3. Understanding the Kerberos Architecture
        1. Authenticating a user
        2. Accessing a secure file server
        3. Understanding important Kerberos terms
      4. Installing Kerberos
        1. Configuring the KDC Server
        2. Testing the KDC installation
        3. Configuring the Kerberos clients
      5. Configuring Kerberos for Apache Hadoop
        1. Configuring Kerberos principal for Cloudera Manager Server
        2. Configuring the Cloudera Manager Server for Kerberos
      6. Authorization in Apache Hadoop
        1. Configuring access control lists in Hadoop
      7. Summary
    15. 7. Managing an Apache Hadoop Cluster
      1. Configuring Hadoop services using Cloudera Manager
        1. Adding a service to the cluster
        2. Removing a service from the cluster
      2. Role management in Cloudera Manager
        1. Adding a role instance to a host
          1. Adding a DataNode role to a host
          2. Adding a TaskTracker role to a host
      3. Managing hosts using Cloudera Manager
        1. Adding a new host
        2. Removing an existing host
      4. Managing multiple clusters with Cloudera Manager
      5. Rebalancing a Hadoop cluster from Cloudera Manager
        1. Adding the Balancer service to the cluster
        2. Rebalancing the cluster
      6. Summary
    16. 8. Cluster Monitoring Using Events and Alerts
      1. Monitoring Hadoop services from Cloudera Manager
      2. Understanding events and alerts
        1. Configuring events and alerts
        2. Configuring the alert delivery by an e-mail
      3. Summary
    17. 9. Configuring Backups
      1. Understanding backups
        1. Types of backups
        2. Types of storage media for backups
        3. Using cloud services for backups
      2. Understanding HDFS backups
      3. Using the distributed copy (DistCp)
      4. Configuring backups using Cloudera Manager
        1. Configuring HDFS replication
        2. Configuring Hive replication
        3. Configuring snapshots
          1. Enabling snapshot paths in HDFS
          2. Configuring a snapshot policy
      5. Summary
    18. Index

Product information

  • Title: Cloudera Administration Handbook
  • Author(s): Rohit Menon
  • Release date: July 2014
  • Publisher(s): Packt Publishing
  • ISBN: 9781783558964