book

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Name: Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture
ISBN: 9780133812350

by George J. Trujillo Jr., Charles Kim, Steven Jones, Rommel Garcia, Justin Murray

July 2015

Intermediate to advanced

480 pages

13h 43m

English

VMware Press

Read now

Unlock full access

About This eBook
Title Page
Copyright Page
We Want to Hear from You!
Reader Services
Dedication Page
About the Authors
Contributor
About the Technical Editor
Acknowledgments

Contents at a Glance
Contents
Foreword
Preface
Motivation for Writing This BookPrerequisitesWho Should Read This BookHow to Use This Book
Part I: Introduction to Hadoop
Chapter 1. Understanding the Big Data World
The Data RevolutionTraditional Data SystemsSemi-Structured and Unstructured DataCausation and CorrelationData ChallengesThe Modern Data ArchitectureOrganizational TransformationIndustry TransformationSummary
Chapter 2. Hadoop Fundamental Concepts
Types of Data in HadoopUse CasesWhat Is Hadoop?Hadoop DistributionsHadoop FrameworksNoSQL DatabasesWhat Is NoSQL?A Hadoop ClusterHadoop Software ProcessesHadoop Hardware ProfilesRoles in the Hadoop EnvironmentSummary
Chapter 3. YARN and HDFS
A Hadoop Cluster Is DistributedHadoop Directory LayoutsHadoop Operating System UsersThe Hadoop Distributed File SystemYARN LoggingThe NameNodeThe DataNodeBlock PlacementNameNode Configurations and Managing MetadataRack AwarenessBlock ManagementThe BalancerMaintaining Data Integrity in the ClusterQuotas and TrashYARN and the YARN Processing ModelRunning Applications on YARNResource SchedulersBenchmarkingTeraSort Benchmarking SuiteSummary
Chapter 4. The Modern Data Platform
Designing a Hadoop ClusterEnterprise Data MovementSummary
Chapter 5. Data Ingestion
Extraction, Loading, and Transformation (ELT)Sqoop: Data Movement with SQL SourcesFlume: Streaming DataOozie: Scheduling and WorkflowFalcon: Data Lifecycle ManagementKafka: Real-time Data StreamingSummary
Chapter 6. Hadoop SQL Engines
Where SQL Was BornSQL in HadoopHadoop SQL EnginesSelecting the SQL Tool For HadoopNow Getting Groovy with Hive and PigHiveHCatalogPigSummary
Chapter 7. Multitenancy in Hadoop
Securing the AccessAuthenticationAuditingAuthorizationData ProtectionIsolating the DataIsolating the ProcessSummary
Part II: Introduction to Virtualization
Chapter 8. Virtualization Fundamentals
Why Virtualize Hadoop?Introduction to VirtualizationSummaryReferences
Chapter 9. Best Practices for Virtualizing Hadoop
Running Virtualized Hadoop with Purpose and DisciplineThe Discipline of Purpose Starts with a Clear TargetVirtualizing Different Tiers of HadoopIndustry Best PracticesSummary
Part III: Virtualizing Hadoop
Chapter 10. Virtualizing Hadoop
How Are Hadoop Ecosystems Going to Be Managed?Building an Enterprise Hadoop Platform That Is Agile and FlexibleClarification of TermsThe Journey from Bare-Metal to VirtualizationWhy Consider Virtualizing Hadoop?Benefits of Virtualizing HadoopVirtualized Hadoop Can Run as Fast or Faster Than NativeCoordination and Cross-Purpose Specialization Is the FutureBarriers Can Be OrganizationalVirtualization Is Not an All or Nothing OptionRapid Provisioning and Improving Quality of Development and Test EnvironmentsImprove High Availability with VirtualizationUse Virtualization to Leverage Hadoop WorkloadsHadoop in the CloudBig Data ExtensionsThe Path to VirtualizationThe Software-Defined Data CenterVirtualizing the NetworkvRealize SuiteSummaryReferences
Chapter 11. Virtualizing Hadoop Master Servers
Virtualizing Servers in a Hadoop ClusterVirtualizing the Environment Around HadoopVirtualizing the Master Hadoop ServersVirtualizing Without the SANSummary
Chapter 12. Virtualizing the Hadoop Worker Nodes
A Brief Introduction to the Worker Nodes in HadoopDeployment Models for Hadoop ClustersThe Combined ModelThe Separated ModelNetwork Effects of the Data-Compute SeparationThe Shared-Storage Approach to the Data-Compute Separated ModelLocal Disks for the Application’s Temporary DataThe Shared Storage Architecture Model Using Network-Attached Storage (NAS)Deployment Model SummaryBest Practices for Virtualizing Hadoop WorkersDisk I/OThe Hadoop Virtualization Extensions (HVE)SummaryReferencesResources
Chapter 13. Deploying Hadoop as a Service in the Private Cloud
The Cloud ContextStakeholders for HadoopOverview of the Solution ArchitectureSummaryReferences
Chapter 14. Understanding the Installation of Hadoop
Map the Right Solutions to the Right Use CaseThoughts About Installing HadoopConfiguring RepositoriesInstalling HDP 2.2Environment PreparationSetting Up the Hadoop ConfigurationStarting HDFS and YARNStart YARNVerifying MapReduce FunctionalityInstalling and Configuring HiveInstalling and Configuring MySQL DatabaseInstalling and Configuring Hive and HCatalogSummary
Chapter 15. Configuring Linux for Hadoop
Supported Linux PlatformsDifferent Deployment ModelsLinux Golden TemplatesBuilding a Linux Enterprise Hadoop PlatformSelecting the Linux DistributionOptimal Linux Kernel Parameters and System SettingsepollDisable Swap SpaceDisable Security During InstallIO Scheduler TuningCheck Transparent Huge Pages ConfigurationLimits.confPartition Alignment for RDMsFile System ConsiderationsLazy Count Parameter for XFSMount OptionsI/O SchedulerDisk Read and Write OptionsStorage BenchmarkingJava VersionSet Up NTPEnable Jumbo FramesAdditional Network ConsiderationsSummary
Appendix A. Hadoop Cluster Creation: A Prerequisite Checklist
Appendix B. Big Data/Hadoop on VMware vSphere Reference Materials
Deployment GuidesReference ArchitecturesCustomer Case StudiesPerformancevSphere Big Data Extensions (BDE)Other vSphere Features and Big Data
Index
Code Snippets

Overview

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility

Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.

First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.

Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it.

Coverage includes the following:

• Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop

• Understanding YARN resource management, HDFS storage, and I/O

• Designing data ingestion, movement, and organization for modern enterprise data platforms

• Defining SQL engine strategies to meet strict SLAs

• Considering security, data isolation, and scheduling for multitenant environments

• Deploying Hadoop as a service in the cloud

• Reviewing the essential concepts, capabilities, and terminology of virtualization

• Applying current best practices, guidelines, and key metrics for Hadoop virtualization

• Managing multiple Hadoop frameworks and products as one unified system

• Virtualizing master and worker nodes to maximize availability and performance

• Installing and configuring Linux for a Hadoop environment

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780133812350Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills