book

Learning YARN

Name: Learning YARN
ISBN: 9781784393960

by Akhil Arora, Shrey Mehrotra

August 2015

Intermediate to advanced

278 pages

5h 54m

English

Packt Publishing

Read now

Unlock full access

Learning YARN
Table of Contents
Learning YARN
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Starting with YARN Basics
Introduction to MapReduce v1
Shortcomings of MapReducev1
An overview of YARN components
ResourceManagerNodeManagerApplicationMasterContainer
The YARN architecture
How YARN satisfies big data needs
Projects powered by YARN
Summary
2. Setting up a Hadoop-YARN Cluster
Starting with the basicsSupported platformsHardware requirementsSoftware requirementsBasic Linux commands / utilitiesSudoNano editorSourceJpsNetstatManPreparing a node for a Hadoop-YARN clusterInstall JavaCreate a Hadoop dedicated user and groupDisable firewall or open Hadoop portsConfigure the domain name resolutionInstall SSH and configure passwordless SSH from the master to all slaves
The Hadoop-YARN single node installation
PrerequisitesInstallation stepsStep 1 – Download and extract the Hadoop bundleStep 2 – Configure the environment variablesStep 3 – Configure the Hadoop configuration filesThe core-site.xml fileThe hdfs-site.xml fileThe mapred-site.xml fileThe yarn-site.xml fileThe hadoop-env.sh and yarn-env.sh filesThe slaves fileStep 4 – Format NameNodeStep 5 – Start Hadoop daemons
An overview of web user interfaces
Run a sample application
The Hadoop-YARN multi-node installation
PrerequisitesInstallation stepsStep 1 – Configure the master node as a single-node Hadoop-YARN installationStep 2 – Copy the Hadoop folder to all the slave nodesStep 3 – Configure environment variables on slave nodesStep 4 – Format NameNodeStep 5 – Start Hadoop daemons
An overview of the Hortonworks and Cloudera installations
Summary
3. Administering a Hadoop-YARN Cluster
Using the Hadoop-YARN commandsThe user commandsJarApplicationCommand optionsSample outputNodeCommand optionsSample outputLogsCommand optionsClasspathVersionAdministration commandsResourceManager / NodeManager / ProxyServerRMAdminCommand optionsDaemonLogCommand options
Configuring the Hadoop-YARN services
The ResourceManager serviceThe NodeManager serviceThe Timeline serverThe web application proxy serverPorts summary
Managing the Hadoop-YARN services
Managing service logsManaging pid files
Monitoring the YARN services
JMX monitoringThe ResourceManager JMX beansThe NodeManager JMX beansGanglia monitoringGanglia daemonsIntegrating Ganglia with Hadoop
Understanding ResourceManager's High Availability
ArchitectureFailover mechanismsConfiguring ResourceManager's High AvailabilityDefine nodesThe RM state store mechanismThe failover proxy providerAutomatic failoverHigh Availability admin commands
Monitoring NodeManager's health
The health checker script
Summary
4. Executing Applications Using YARN
Understanding application execution flowPhase 1 – Application initialization and submissionPhase 2 – Allocate memory and start ApplicationMasterPhase 3 – ApplicationMaster registration and resource allocationPhase 4 – Launch and monitor containersPhase 5 – Application progress reportPhase 6 – Application completion
Submitting a sample MapReduce application
Submitting an application to the clusterUpdates in the ResourceManager web UIUnderstanding the application processTracking application detailsThe ApplicationMaster processCluster nodes informationNode's container listYARN child processesApplication details after completion
Handling failures in YARN
The container failureThe NodeManager failureThe ResourceManager failure
YARN application logging
Services logsApplication logs
Summary
5. Understanding YARN Life Cycle Management
An introduction to state management analogy
The ResourceManager's view
View 1 – NodeView 2 – ApplicationView 3 – An application attemptView 4 – Container
The NodeManager's view
View 1 – ApplicationView 2 – ContainerView 3 – A localized resource
Analyzing transitions through logs
NodeManager registration with ResourceManagerApplication submissionContainer resource allocationResource localization
Summary
6. Migrating from MRv1 to MRv2
Introducing MRv1 and MRv2
High-level changes from MRv1 to MRv2
The evolution of the MRApplicationMaster serviceResource capabilityPluggable shuffleHierarchical queues and fair schedulerTask execution as containers
The migration steps from MRv1 to MRv2
Configuration changesThe binary / source compatibility
Running and monitoring MRv1 apps on YARN
Summary
7. Writing Your Own YARN Applications
An introduction to the YARN APIYARNConfigurationLoad resourcesFinal propertiesVariable expansionApplicationSubmissionContextContainerLaunchContextCommunication protocolsApplicationClientProtocolApplicationMasterProtocolContainerManagementProtocolApplicationHistoryProtocolYARN client API
Writing your own application
Step 1 – Create a new project and add Hadoop-YARN JAR filesStep 2 – Define the ApplicationMaster and client classesDefine an ApplicationMasterDefine a YARN clientStep 3 – Export the project and copy resourcesStep 4 – Run the application using bin or the YARN command
Summary
8. Dive Deep into YARN Components
Understanding ResourceManagerThe client and admin interfacesThe core interfacesThe NodeManager interfacesThe security and token managers
Understanding NodeManager
Status updatesState and health managementContainer managementThe security and token managers
The YARN Timeline server
The web application proxy server
YARN Scheduler Load Simulator (SLS)
Handling resource localization in YARN
Resource localization terminologiesThe resource localization directory structure
Summary
9. Exploring YARN REST Services
Introduction to YARN REST servicesHTTP request and responseSuccessful responseResponse with an error
ResourceManager REST APIs
The cluster summaryScheduler detailsNodesApplications
NodeManager REST APIs
The node summaryApplicationsContainers
MapReduce ApplicationMaster REST APIs
ApplicationMaster summaryJobsTasks
MapReduce HistoryServer REST APIs
How to access REST services
RESTClient pluginsCurl commandJava API
Summary
10. Scheduling YARN Applications
An introduction to scheduling in YARN
An overview of queues
Types of queues
CapacityScheduler Queue (CSQueue)FairScheduler Queue (FSQueue)
An introduction to schedulers
Fair schedulerHierarchical queuesSchedulableScheduling policyConfiguring a fair schedulerCapacitySchedulerConfiguring CapacityScheduler
Summary
11. Enabling Security in YARN
Adding security to a YARN clusterUsing a dedicated user group for Hadoop-YARN daemonsValidating permissions to YARN directoriesEnabling the HTTPS protocolEnabling authorization using Access Control ListsEnabling authentication using Kerberos
Working with ACLs
Defining an ACL valueType of ACLsThe administration ACLThe service-level ACLThe queue ACLThe application ACL
Other security frameworks
Apache RangerApache Knox
Summary
12. Real-time Data Analytics Using YARN
The integration of Spark with YARNRunning Spark on YARN
The integration of Storm with YARN
Running Storm on YARNCreate a Zookeeper quorumDownload, extract, and prepare the Storm bundleCopy Storm ZIP to HDFSConfiguring the storm.yaml fileLaunching the Storm-YARN clusterManaging Storm on YARN
The integration of HAMA and Giraph with YARN
Summary
Index

Content preview from Learning YARN

Chapter 1. Starting with YARN Basics

In early 2006, Apache Hadoop was introduced as a framework for the distributed processing of large datasets stored across clusters of computers, using a programming model. Hadoop was developed as a solution to handle big data in a cost effective and easiest way possible. Hadoop consisted of a storage layer, that is, Hadoop Distributed File System (HDFS) and the MapReduce framework for managing resource utilization and job execution on a cluster. With the ability to deliver high performance parallel data analysis and to work with commodity hardware, Hadoop is used for big data analysis and batch processing of historical data through MapReduce programming.

With the exponential increase in the usage of social networking ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781784393960

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning YARN

by Akhil Arora, Shrey Mehrotra

Chapter 1. Starting with YARN Basics

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.