book

Expert Hadoop® Administration

Name: Expert Hadoop® Administration
Author: Sam R. Alapati
ISBN: 9780134598147

by Sam R. Alapati

December 2016

Intermediate to advanced

848 pages

26h 22m

English

Addison-Wesley Professional

Read now

Unlock full access

About This E-Book
Title Page
Copyright Page
Dedication Page
Contents
Foreword
Preface
Who This Book Is ForHow This Book Is Structured and What It Covers
Acknowledgments
About the Author
I: Introduction to Hadoop—Architecture and Hadoop Clusters

1. Introduction to Hadoop and Its Environment
Hadoop—An IntroductionUnique Features of HadoopBig Data and HadoopA Typical Scenario for Using HadoopTraditional Database SystemsData LakeBig Data, Data Science and HadoopCluster Computing and Hadoop ClustersCluster ComputingHadoop ClustersHadoop Components and the Hadoop EcosphereWhat Do Hadoop Administrators Do?Hadoop Administration—A New ParadigmWhat You Need to Know to Administer HadoopThe Hadoop Administrator’s ToolsetKey Differences between Hadoop 1 and Hadoop 2Architectural DifferencesHigh-Availability FeaturesMultiple Processing EnginesSeparation of Processing and SchedulingResource Allocation in Hadoop 1 and Hadoop 2Distributed Data Processing: MapReduce and Spark, Hive and PigMapReduceApache SparkApache HiveApache PigData Integration: Apache Sqoop, Apache Flume and Apache KafkaKey Areas of Hadoop AdministrationManaging the Cluster StorageAllocating the Cluster ResourcesScheduling JobsSecuring Hadoop DataSummary
2. An Introduction to the Architecture of Hadoop
Distributed Computing and HadoopHadoop ArchitectureA Hadoop ClusterMaster and Worker NodesHadoop ServicesData Storage—The Hadoop Distributed File SystemHDFS Unique FeaturesHDFS ArchitectureThe HDFS File SystemNameNode OperationsData Processing with YARN, the Hadoop Operating SystemArchitecture of YARNHow the ApplicationMaster Works with the ResourceManager to Allocate ResourcesSummary
3. Creating and Configuring a Simple Hadoop Cluster
Hadoop Distributions and Installation TypesHadoop DistributionsHadoop Installation TypesSetting Up a Pseudo-Distributed Hadoop ClusterMeeting the Operating System RequirementsModifying Kernel ParametersSetting Up SSHJava RequirementsInstalling the Hadoop SoftwareCreating the Necessary Hadoop UsersCreating the Necessary DirectoriesPerforming the Initial Hadoop ConfigurationEnvironment Configuration FilesRead-Only Default Configuration FilesSite-Specific Configuration FilesOther Hadoop-Related Configuration FilesPrecedence among the Configuration FilesVariable Expansion and Configuration ParametersConfiguring the Hadoop Daemons EnvironmentConfiguring Core Hadoop Properties (with the core-site.xml File)Configuring MapReduce (with the mapred-site.xml File)Configuring YARN (with the yarn-site.xml File)Operating the New Hadoop ClusterFormatting the Distributed File SystemSetting the Environment VariablesStarting the HDFS and YARN ServicesVerifying the Service StartupShutting Down the ServicesSummary
4. Planning for and Creating a Fully Distributed Cluster
Planning Your Hadoop ClusterGeneral Cluster Planning ConsiderationsServer Form FactorsCriteria for Choosing the NodesGoing from a Single Rack to Multiple RacksSizing a Hadoop ClusterGeneral Principles Governing the Choice of CPU, Memory and StorageSpecial Treatment for the Master NodesRecommendations for Sizing the ServersGrowing a ClusterGuidelines for Large ClustersCreating a Multinode ClusterHow the Test Cluster Is Set UpModifying the Hadoop ConfigurationChanging the HDFS Configuration (hdfs-site.xml file)Changing the YARN ConfigurationChanging the MapReduce ConfigurationStarting Up the ClusterStarting Up and Shutting Down the Cluster with ScriptsPerforming a Quick Check of the New Cluster’s File SystemConfiguring Hadoop Services, Web Interfaces and PortsService Configuration and Web InterfacesSetting Port Numbers for Hadoop ServicesHadoop ClientsSummary
II: Hadoop Application Frameworks
5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig)
The MapReduce FrameworkThe MapReduce ModelHow MapReduce WorksMapReduce Job ProcessingA Simple MapReduce ProgramUnderstanding Hadoop’s Job Processing—Running a WordCount ProgramMapReduce Input and Output DirectoriesHow Hadoop Shows You the Job DetailsHadoop StreamingApache HiveHive Data OrganizationWorking with Hive TablesLoading Data into HiveQuerying with HiveApache PigPig Execution ModesA Simple Pig ExampleSummary
6. Running Applications in a Cluster—The Spark Framework
What Is Spark?Why Spark?SpeedEase of Use and AccessibilityGeneral-Purpose FrameworkSpark and HadoopThe Spark StackInstalling SparkSpark ExamplesKey Spark Files and DirectoriesCompiling the Spark BinariesReducing Spark’s VerbositySpark Run ModesLocal ModeCluster ModeUnderstanding the Cluster ManagersThe Standalone Cluster ManagerSpark on Apache MesosSpark on YARNHow YARN and Spark Work TogetherSetting Up Spark on a Hadoop ClusterSpark and Data AccessLoading Data from the Linux File SystemLoading Data from HDFSLoading Data from a Relational DatabaseSummary
7. Running Spark Applications
The Spark Programming ModelSpark Programming and RDDsProgramming SparkSpark ApplicationsBasics of RDDsCreating an RDDRDD OperationsRDD PersistenceArchitecture of a Spark ApplicationSpark TerminologyComponents of a Spark ApplicationRunning Spark Applications InteractivelySpark Shell and Spark ApplicationsA Bit about the Spark ShellUsing the Spark ShellOverview of Spark Cluster ExecutionCreating and Submitting Spark ApplicationsBuilding the Spark ApplicationRunning an Application in the Standalone Spark ClusterUsing spark-submit to Execute ApplicationsRunning Spark Applications on MesosRunning Spark Applications in a YARN-Managed Hadoop ClusterUsing the JDBC/ODBC ServerConfiguring Spark ApplicationsSpark Configuration PropertiesSpecifying Configuration when Running spark-submitMonitoring Spark ApplicationsHandling Streaming Data with Spark StreamingHow Spark Streaming WorksA Spark Streaming Example—WordCount Again!Using Spark SQL for Handling Structured DataDataFramesHiveContext and SQLContextWorking with Spark SQLCreating DataFramesSummary
III: Managing and Protecting Hadoop Data and High Availability
8. The Role of the NameNode and How HDFS Works
HDFS—The Interaction between the NameNode and the DataNodesInteraction between the Clients and HDFSNameNode and DataNode CommunicationsRack Awareness and TopologyHow to Configure Rack Awareness in Your ClusterFinding Your Cluster’s Rack InformationHDFS Data ReplicationHDFS Data Organization and Data BlocksData ReplicationBlock and Replica StatesHow Clients Read and Write HDFS DataHow Clients Read HDFS DataHow Clients Write Data to HDFSUnderstanding HDFS Recovery ProcessesGeneration StampLease RecoveryBlock RecoveryPipeline RecoveryCentralized Cache Management in HDFSHadoop and OS Page CachingThe Key Principles Behind Centralized Cache ManagementHow Centralized Cache Management WorksConfiguring CachingCache DirectivesCache PoolsUsing the CacheHadoop Archival Storage, SSD and Memory (Heterogeneous Storage)Performance Characteristics of Storage TypesThe Need for Heterogeneous HDFS StorageChanges in the Storage ArchitectureStorage Preferences for FilesSetting Up Archival StorageManaging Storage PoliciesMoving Data AroundImplementing Archival StorageSummary
9. HDFS Commands, HDFS Permissions and HDFS Storage
Managing HDFS through the HDFS Shell CommandsUsing the hdfs dfs Utility to Manage HDFSListing HDFS Files and DirectoriesCreating an HDFS DirectoryRemoving HDFS Files and DirectoriesChanging File and Directory Ownership and GroupsUsing the dfsadmin Utility to Perform HDFS OperationsThe dfsadmin –report CommandManaging HDFS Permissions and UsersHDFS File PermissionsHDFS Users and Super UsersManaging HDFS StorageChecking HDFS Disk UsageAllocating HDFS Space QuotasRebalancing HDFS DataReasons for HDFS Data ImbalanceRunning the Balancer Tool to Balance HDFS DataUsing hdfs dfsadmin to Make Things EasierWhen to Run the BalancerReclaiming HDFS SpaceRemoving Files and DirectoriesDecreasing the Replication FactorSummary
10. Data Protection, File Formats and Accessing HDFS
Safeguarding DataUsing HDFS Trash to Prevent Accidental Data DeletionUsing HDFS Snapshots to Protect Important DataEnsuring Data Integrity with File System ChecksData CompressionCommon Compression FormatsEvaluating the Various Compression SchemesCompression at Various Stages for MapReduceCompression for SparkData SerializationHadoop File FormatsCriteria for Determining the Right File FormatFile Formats Supported by HadoopThe Ideal File FormatThe Hadoop Small Files Problem and Merging FilesUsing a Federated NameNode to Overcome the Small Files ProblemUsing Hadoop Archives to Manage Many Small FilesHandling the Performance Impact of Small FilesUsing Hadoop WebHDFS and HttpFSWebHDFS—The Hadoop REST APIUsing the WebHDFS APIUnderstanding the WebHDFS CommandsUsing HttpFS Gateway to Access HDFS from Behind a FirewallSummary
11. NameNode Operations, High Availability and Federation
Understanding NameNode OperationsHDFS MetadataThe NameNode Startup ProcessHow the NameNode and the DataNodes Work TogetherThe Checkpointing ProcessSecondary, Checkpoint, Backup and Standby NodesConfiguring the Checkpointing FrequencyManaging Checkpoint PerformanceThe Mechanics of CheckpointingNameNode Safe Mode OperationsAutomatic Safe Mode OperationsPlacing the NameNode in Safe ModeHow the NameNode Transitions Through Safe ModeBacking Up and Recovering the NameNode MetadataConfiguring HDFS High AvailabilityNameNode HA Architecture (QJM)Setting Up an HDFS HA Quorum ClusterDeploying the High-Availability NameNodesManaging an HA NameNode SetupHA Manual and Automatic FailoverHDFS FederationArchitecture of a Federated NameNodeSummary
IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
12. Moving Data Into and Out of Hadoop
Introduction to Hadoop Data Transfer ToolsLoading Data into HDFS from the Command LineUsing the -cat Command to Dump a File’s ContentsTesting HDFS FilesCopying and Moving Files from and to HDFSUsing the -get Command to Move FilesMoving Files from and to HDFSUsing the -tail and head CommandsCopying HDFS Data between Clusters with DistCpHow to Use the DistCp Command to Move DataDistCp OptionsIngesting Data from Relational Databases with SqoopSqoop ArchitectureDeploying SqoopUsing Sqoop to Move DataImporting Data with SqoopImporting Data into HiveExporting Data with SqoopIngesting Data from External Sources with FlumeFlume Architecture in a NutshellConfiguring the Flume AgentA Simple Flume ExampleUsing Flume to Move Data to HDFSA More Complex Flume ExampleIngesting Data with KafkaBenefits Offered by KafkaHow Kafka WorksSetting Up an Apache Kafka ClusterIntegrating Kafka with Hadoop and StormSummary
13. Resource Allocation in a Hadoop Cluster
Resource Allocation in HadoopManaging Cluster WorkloadsHadoop’s Resource SchedulersThe FIFO SchedulerThe Capacity SchedulerQueues and SubqueuesHow the Cluster Allocates ResourcesPreempting ApplicationsEnabling the Capacity SchedulerA Typical Capacity SchedulerThe Fair SchedulerQueuesConfiguring the Fair SchedulerHow Jobs Are Placed into QueuesApplication Preemption in the Fair SchedulerSecurity and Resource PoolsA Sample fair-scheduler.xml FileSubmitting Jobs to the SchedulerMoving Applications between QueuesMonitoring the Fair SchedulerComparing the Capacity Scheduler and the Fair SchedulerSimilarities between the Two SchedulersDifferences between the Two SchedulersSummary
14. Working with Oozie to Manage Job Workflows
Using Apache Oozie to Schedule JobsOozie ArchitectureThe Oozie ServerThe Oozie ClientThe Oozie DatabaseDeploying Oozie in Your ClusterInstalling and Configuring OozieConfiguring Hadoop for OozieUnderstanding Oozie WorkflowsWorkflows, Control Flow, and NodesDefining the Workflows with the workflow.xml FileHow Oozie Runs an ActionConfiguring the Action NodesCreating an Oozie WorkflowConfiguring the Control NodesConfiguring the JobRunning an Oozie Workflow JobSpecifying the Job PropertiesDeploying Oozie JobsCreating Dynamic WorkflowsOozie CoordinatorsTime-Based CoordinatorsData-Based CoordinatorsTime-and-Data-Based CoordinatorsSubmitting the Oozie Coordinator from the Command LineManaging and Administering OozieCommon Oozie Commands and How to Run ThemTroubleshooting OozieOozie cron Scheduling and Oozie Service Level AgreementsSummary
15. Securing Hadoop
Hadoop Security—An OverviewAuthentication, Authorization and AccountingHadoop Authentication with KerberosKerberos and How It WorksThe Kerberos Authentication ProcessKerberos TrustsA Special PrincipalAdding Kerberos Authorization to your ClusterSetting Up Kerberos for HadoopSecuring a Hadoop Cluster with KerberosHow Kerberos Authenticates Users and ServicesManaging a Kerberized Hadoop ClusterHadoop AuthorizationHDFS PermissionsService Level AuthorizationRole-Based Authorization with Apache SentryAuditing HadoopAuditing HDFS OperationsAuditing YARN OperationsSecuring Hadoop DataHDFS Transparent EncryptionEncrypting Data in TransitionOther Hadoop-Related Security InitiativesSecuring a Hadoop Infrastructure with Apache Knox GatewayApache Ranger for Security AdministrationSummary
V: Monitoring, Optimization and Troubleshooting
16. Managing Jobs, Using Hue and Performing Routine Tasks
Using the YARN Commands to Manage Hadoop JobsViewing YARN ApplicationsChecking the Status of an ApplicationKilling a Running ApplicationChecking the Status of the NodesChecking YARN QueuesGetting the Application LogsYarn Administrative CommandsDecommissioning and Recommissioning NodesIncluding and Excluding HostsDecommissioning DataNodes and NodeManagersRecommissioning NodesThings to Remember about Decommissioning and RecommissioningAdding a New DataNode and/or a NodeManagerResourceManager High AvailabilityResourceManager High-Availability ArchitectureSetting Up ResourceManager High AvailabilityResourceManager FailoverUsing the ResourceManager High-Availability CommandsPerforming Common Management TasksMoving the NameNode to a Different HostManaging High-Availability NameNodesUsing a Shutdown/Startup Script to Manage your ClusterBalancing HDFSBalancing the Storage on the DataNodesManaging the MySQL DatabaseConfiguring a MySQL DatabaseConfiguring MySQL High AvailabilityBacking Up Important Cluster DataBacking Up HDFS MetadataBacking Up the Metastore DatabasesUsing Hue to Administer Your ClusterAllowing Your Users to Use HueInstalling HueConfiguring Your Cluster to Work with HueManaging HueWorking with HueImplementing Specialized HDFS FeaturesDeploying HDFS and YARN in a Multihomed NetworkShort-Circuit Local ReadsMountable HDFSUsing an NFS Gateway for Mounting HDFS to a Local File SystemSummary
17. Monitoring, Metrics and Hadoop Logging
Monitoring Linux ServersBasics of Linux System MonitoringMonitoring Tools for Linux SystemsHadoop MetricsHadoop Metric TypesUsing the Hadoop MetricsCapturing Metrics to a File SystemUsing Ganglia for MonitoringGanglia ArchitectureSetting Up the Ganglia and Hadoop IntegrationSetting Up the Hadoop MetricsUnderstanding Hadoop LoggingHadoop Log MessagesDaemon and Application Logs and How to View ThemHow Application Logging WorksHow Hadoop Uses HDFS Staging Directories and Local Directories During a Job RunHow the NodeManager Uses the Local DirectoriesStoring Job Logs in HDFS through Log AggregationWorking with the Hadoop Daemon LogsUsing Hadoop’s Web UIs for MonitoringMonitoring Jobs with the ResourceManager Web UIThe JobHistoryServer Web UIMonitoring with the NameNode Web UIMonitoring Other Hadoop ComponentsMonitoring HiveMonitoring SparkSummary
18. Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking
How to Allocate YARN Memory and CPUAllocating MemoryConfiguring the Number of CPU CoresRelationship between Memory and CPU VcoresConfiguring Efficient PerformanceSpeculative ExecutionReducing the I/O Load on the SystemTuning Map and Reduce Tasks—What the Administrator Can DoTuning the Map TasksInput and OutputTuning the Reduce TasksTuning the MapReduce Shuffle ProcessOptimizing Pig and Hive JobsOptimizing Hive JobsOptimizing Pig JobsBenchmarking Your ClusterUsing TestDFSIO for Testing I/O PerformanceBenchmarking with TeraSortUsing Hadoop’s Rumen and GridMix for BenchmarkingHadoop CountersFile System CountersJob CountersMapReduce Framework CountersCustom Java CountersLimiting the Number of CountersOptimizing MapReduceMap-Only versus Map and Reduce JobsHow Combiners Improve MapReduce PerformanceUsing a Partitioner to Improve PerformanceCompressing Data During the MapReduce ProcessToo Many Mappers or Reducers?Summary
19. Configuring and Tuning Apache Spark on YARN
Configuring Resource Allocation for Spark on YARNAllocating CPUAllocating MemoryHow Resources are Allocated to SparkLimits on the Resource Allocation to Spark ApplicationsAllocating Resources to the DriverConfiguring Resources for the ExecutorsHow Spark Uses its MemoryThings to RememberCluster or Client Mode?Configuring Spark-Related Network ParametersDynamic Resource Allocation when Running Spark on YARNDynamic and Static Resource AllocationHow Spark Manages Dynamic Resource AllocationEnabling Dynamic Resource AllocationStorage Formats and Compressing DataStorage FormatsFile SizesCompressionMonitoring Spark ApplicationsUsing the Spark Web UI to Understand PerformanceSpark System and the Metrics REST APIThe Spark History Server on YARNTracking Jobs from the Command LineTuning Garbage CollectionThe Mechanics of Garbage CollectionHow to Collect GC StatisticsTuning Spark Streaming ApplicationsReducing Batch Processing TimeSetting the Right Batch IntervalTuning Memory and Garbage CollectionSummary
20. Optimizing Spark Applications
Revisiting the Spark Execution ModelThe Spark Execution ModelShuffle Operations and How to Minimize ThemA WordCount Example to Our Rescue AgainImpact of a Shuffle OperationConfiguring the Shuffle ParametersPartitioning and Parallelism (Number of Tasks)Level of ParallelismProblems with Too Few TasksSetting the Default Number of PartitionsHow to Increase the Number of PartitionsUsing the Repartition and Coalesce Operators to Change the Number of Partitions in an RDDTwo Types of PartitionersData Partitioning and How It Can Avoid a ShuffleOptimizing Data Serialization and CompressionData SerializationConfiguring CompressionUnderstanding Spark’s SQL Query OptimizerUnderstanding the Optimizer StepsSpark’s Speculative Execution FeatureThe Importance of Data LocalityCaching DataFault-Tolerance Due to CachingHow to Specify CachingSummary
21. Troubleshooting Hadoop—A Sampler
Space-Related IssuesDealing with a 100 Percent Full Linux File SystemHDFS Space IssuesLocal and Log Directories Out of Free SpaceDisk Volume Failure TolerationHandling YARN Jobs That Are StuckJVM Memory-Allocation and Garbage-Collection StrategiesUnderstanding JVM Garbage CollectionOptimizing Garbage CollectionAnalyzing Memory UsageOut of Memory ErrorsApplicationMaster Memory IssuesHandling Different Types of FailuresHandling Daemon FailuresStarting Failures for Hadoop DaemonsTask and Job FailuresTroubleshooting Spark JobsSpark’s Fault Tolerance MechanismKilling Spark JobsMaximum Attempts for a JobMaximum Failures per JobDebugging Spark ApplicationsViewing Logs with Log AggregationViewing Logs When Log Aggregation Is Not EnabledReviewing the Launch EnvironmentSummary
A. Installing VirtualBox and Linux and Cloning the Virtual Machines
Installing Oracle VirtualBoxInstalling Oracle Enterprise LinuxCloning the Linux Server
Index
Code Snippets

Content preview from Expert Hadoop® Administration

IV: Moving Data, Allocating Resources, Scheduling Jobs and Security

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780134598147Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Expert Hadoop® Administration

by Sam R. Alapati

IV: Moving Data, Allocating Resources, Scheduling Jobs and Security

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.