Book description
The Comprehensive, Up-to-Date Apache
Hadoop Administration Handbook and Reference
“Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.” –Paul Dix, Series Editor
In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.
Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run.
Understand Hadoop’s architecture from an administrator’s standpoint
Create simple and fully distributed clusters
Run MapReduce and Spark applications in a Hadoop cluster
Manage and protect Hadoop data and high availability
Work with HDFS commands, file permissions, and storage management
Move data, and use YARN to allocate resources and schedule jobs
Manage job workflows with Oozie and Hue
Secure, monitor, log, and optimize Hadoop
Benchmark and troubleshoot Hadoop
Table of contents
- About This E-Book
- Title Page
- Copyright Page
- Dedication Page
- Contents
- Foreword
- Preface
- Acknowledgments
- About the Author
-
I: Introduction to Hadoop—Architecture and Hadoop Clusters
-
1. Introduction to Hadoop and Its Environment
- Hadoop—An Introduction
- Cluster Computing and Hadoop Clusters
- Hadoop Components and the Hadoop Ecosphere
- What Do Hadoop Administrators Do?
- Key Differences between Hadoop 1 and Hadoop 2
- Distributed Data Processing: MapReduce and Spark, Hive and Pig
- Data Integration: Apache Sqoop, Apache Flume and Apache Kafka
- Key Areas of Hadoop Administration
- Summary
- 2. An Introduction to the Architecture of Hadoop
-
3. Creating and Configuring a Simple Hadoop Cluster
- Hadoop Distributions and Installation Types
- Setting Up a Pseudo-Distributed Hadoop Cluster
-
Performing the Initial Hadoop Configuration
- Environment Configuration Files
- Read-Only Default Configuration Files
- Site-Specific Configuration Files
- Other Hadoop-Related Configuration Files
- Precedence among the Configuration Files
- Variable Expansion and Configuration Parameters
- Configuring the Hadoop Daemons Environment
- Configuring Core Hadoop Properties (with the core-site.xml File)
- Configuring MapReduce (with the mapred-site.xml File)
- Configuring YARN (with the yarn-site.xml File)
- Operating the New Hadoop Cluster
- Summary
- 4. Planning for and Creating a Fully Distributed Cluster
-
1. Introduction to Hadoop and Its Environment
-
II: Hadoop Application Frameworks
- 5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig)
- 6. Running Applications in a Cluster—The Spark Framework
-
7. Running Spark Applications
- The Spark Programming Model
- Spark Applications
- Architecture of a Spark Application
- Running Spark Applications Interactively
- Creating and Submitting Spark Applications
- Configuring Spark Applications
- Monitoring Spark Applications
- Handling Streaming Data with Spark Streaming
- Using Spark SQL for Handling Structured Data
- Summary
-
III: Managing and Protecting Hadoop Data and High Availability
- 8. The Role of the NameNode and How HDFS Works
- 9. HDFS Commands, HDFS Permissions and HDFS Storage
-
10. Data Protection, File Formats and Accessing HDFS
- Safeguarding Data
- Data Compression
-
Hadoop File Formats
- Criteria for Determining the Right File Format
- File Formats Supported by Hadoop
- The Ideal File Format
- The Hadoop Small Files Problem and Merging Files
- Using a Federated NameNode to Overcome the Small Files Problem
- Using Hadoop Archives to Manage Many Small Files
- Handling the Performance Impact of Small Files
- Using Hadoop WebHDFS and HttpFS
- Summary
- 11. NameNode Operations, High Availability and Federation
-
IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
- 12. Moving Data Into and Out of Hadoop
- 13. Resource Allocation in a Hadoop Cluster
- 14. Working with Oozie to Manage Job Workflows
- 15. Securing Hadoop
-
V: Monitoring, Optimization and Troubleshooting
-
16. Managing Jobs, Using Hue and Performing Routine Tasks
- Using the YARN Commands to Manage Hadoop Jobs
- Decommissioning and Recommissioning Nodes
- ResourceManager High Availability
- Performing Common Management Tasks
- Managing the MySQL Database
- Backing Up Important Cluster Data
- Using Hue to Administer Your Cluster
- Implementing Specialized HDFS Features
- Summary
-
17. Monitoring, Metrics and Hadoop Logging
- Monitoring Linux Servers
- Hadoop Metrics
- Using Ganglia for Monitoring
-
Understanding Hadoop Logging
- Hadoop Log Messages
- Daemon and Application Logs and How to View Them
- How Application Logging Works
- How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
- How the NodeManager Uses the Local Directories
- Storing Job Logs in HDFS through Log Aggregation
- Working with the Hadoop Daemon Logs
- Using Hadoop’s Web UIs for Monitoring
- Monitoring Other Hadoop Components
- Summary
- 18. Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking
-
19. Configuring and Tuning Apache Spark on YARN
-
Configuring Resource Allocation for Spark on YARN
- Allocating CPU
- Allocating Memory
- How Resources are Allocated to Spark
- Limits on the Resource Allocation to Spark Applications
- Allocating Resources to the Driver
- Configuring Resources for the Executors
- How Spark Uses its Memory
- Things to Remember
- Cluster or Client Mode?
- Configuring Spark-Related Network Parameters
- Dynamic Resource Allocation when Running Spark on YARN
- Storage Formats and Compressing Data
- Monitoring Spark Applications
- Tuning Garbage Collection
- Tuning Spark Streaming Applications
- Summary
-
Configuring Resource Allocation for Spark on YARN
- 20. Optimizing Spark Applications
- 21. Troubleshooting Hadoop—A Sampler
-
16. Managing Jobs, Using Hue and Performing Routine Tasks
- A. Installing VirtualBox and Linux and Cloning the Virtual Machines
- Index
- Code Snippets
Product information
- Title: Expert Hadoop® Administration
- Author(s):
- Release date: December 2016
- Publisher(s): Addison-Wesley Professional
- ISBN: 9780134598147
You might also like
book
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem
Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x …
book
Cloudera Administration Handbook
A complete, hands-on guide to building and maintaining large Apache Hadoop clusters using Cloudera Manager and …
book
Practical Hive: A Guide to Hadoop's Data Warehouse System
Dive into the world of SQL on Hadoop and get the most out of your Hive …
book
Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools
Learn how to use the Apache Hadoop projects, including MapReduce, HDFS, Apache Hive, Apache HBase, Apache …