book

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Name: Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture
ISBN: 9780133812350

by George J. Trujillo Jr., Charles Kim, Steven Jones, Rommel Garcia, Justin Murray

July 2015

Intermediate to advanced

480 pages

13h 43m

English

VMware Press

Read now

Unlock full access

About This eBook
Title Page
Copyright Page
We Want to Hear from You!
Reader Services
Dedication Page
About the Authors
Contributor
About the Technical Editor
Acknowledgments

Contents at a Glance
Contents
Foreword
Preface
Motivation for Writing This BookPrerequisitesWho Should Read This BookHow to Use This Book
Part I: Introduction to Hadoop
Chapter 1. Understanding the Big Data World
The Data RevolutionTraditional Data SystemsSemi-Structured and Unstructured DataCausation and CorrelationData ChallengesThe Modern Data ArchitectureOrganizational TransformationIndustry TransformationSummary
Chapter 2. Hadoop Fundamental Concepts
Types of Data in HadoopUse CasesWhat Is Hadoop?Hadoop DistributionsHadoop FrameworksNoSQL DatabasesWhat Is NoSQL?A Hadoop ClusterHadoop Software ProcessesHadoop Hardware ProfilesRoles in the Hadoop EnvironmentSummary
Chapter 3. YARN and HDFS
A Hadoop Cluster Is DistributedHadoop Directory LayoutsHadoop Operating System UsersThe Hadoop Distributed File SystemYARN LoggingThe NameNodeThe DataNodeBlock PlacementNameNode Configurations and Managing MetadataRack AwarenessBlock ManagementThe BalancerMaintaining Data Integrity in the ClusterQuotas and TrashYARN and the YARN Processing ModelRunning Applications on YARNResource SchedulersBenchmarkingTeraSort Benchmarking SuiteSummary
Chapter 4. The Modern Data Platform
Designing a Hadoop ClusterEnterprise Data MovementSummary
Chapter 5. Data Ingestion
Extraction, Loading, and Transformation (ELT)Sqoop: Data Movement with SQL SourcesFlume: Streaming DataOozie: Scheduling and WorkflowFalcon: Data Lifecycle ManagementKafka: Real-time Data StreamingSummary
Chapter 6. Hadoop SQL Engines
Where SQL Was BornSQL in HadoopHadoop SQL EnginesSelecting the SQL Tool For HadoopNow Getting Groovy with Hive and PigHiveHCatalogPigSummary
Chapter 7. Multitenancy in Hadoop
Securing the AccessAuthenticationAuditingAuthorizationData ProtectionIsolating the DataIsolating the ProcessSummary
Part II: Introduction to Virtualization
Chapter 8. Virtualization Fundamentals
Why Virtualize Hadoop?Introduction to VirtualizationSummaryReferences
Chapter 9. Best Practices for Virtualizing Hadoop
Running Virtualized Hadoop with Purpose and DisciplineThe Discipline of Purpose Starts with a Clear TargetVirtualizing Different Tiers of HadoopIndustry Best PracticesSummary
Part III: Virtualizing Hadoop
Chapter 10. Virtualizing Hadoop
How Are Hadoop Ecosystems Going to Be Managed?Building an Enterprise Hadoop Platform That Is Agile and FlexibleClarification of TermsThe Journey from Bare-Metal to VirtualizationWhy Consider Virtualizing Hadoop?Benefits of Virtualizing HadoopVirtualized Hadoop Can Run as Fast or Faster Than NativeCoordination and Cross-Purpose Specialization Is the FutureBarriers Can Be OrganizationalVirtualization Is Not an All or Nothing OptionRapid Provisioning and Improving Quality of Development and Test EnvironmentsImprove High Availability with VirtualizationUse Virtualization to Leverage Hadoop WorkloadsHadoop in the CloudBig Data ExtensionsThe Path to VirtualizationThe Software-Defined Data CenterVirtualizing the NetworkvRealize SuiteSummaryReferences
Chapter 11. Virtualizing Hadoop Master Servers
Virtualizing Servers in a Hadoop ClusterVirtualizing the Environment Around HadoopVirtualizing the Master Hadoop ServersVirtualizing Without the SANSummary
Chapter 12. Virtualizing the Hadoop Worker Nodes
A Brief Introduction to the Worker Nodes in HadoopDeployment Models for Hadoop ClustersThe Combined ModelThe Separated ModelNetwork Effects of the Data-Compute SeparationThe Shared-Storage Approach to the Data-Compute Separated ModelLocal Disks for the Application’s Temporary DataThe Shared Storage Architecture Model Using Network-Attached Storage (NAS)Deployment Model SummaryBest Practices for Virtualizing Hadoop WorkersDisk I/OThe Hadoop Virtualization Extensions (HVE)SummaryReferencesResources
Chapter 13. Deploying Hadoop as a Service in the Private Cloud
The Cloud ContextStakeholders for HadoopOverview of the Solution ArchitectureSummaryReferences
Chapter 14. Understanding the Installation of Hadoop
Map the Right Solutions to the Right Use CaseThoughts About Installing HadoopConfiguring RepositoriesInstalling HDP 2.2Environment PreparationSetting Up the Hadoop ConfigurationStarting HDFS and YARNStart YARNVerifying MapReduce FunctionalityInstalling and Configuring HiveInstalling and Configuring MySQL DatabaseInstalling and Configuring Hive and HCatalogSummary
Chapter 15. Configuring Linux for Hadoop
Supported Linux PlatformsDifferent Deployment ModelsLinux Golden TemplatesBuilding a Linux Enterprise Hadoop PlatformSelecting the Linux DistributionOptimal Linux Kernel Parameters and System SettingsepollDisable Swap SpaceDisable Security During InstallIO Scheduler TuningCheck Transparent Huge Pages ConfigurationLimits.confPartition Alignment for RDMsFile System ConsiderationsLazy Count Parameter for XFSMount OptionsI/O SchedulerDisk Read and Write OptionsStorage BenchmarkingJava VersionSet Up NTPEnable Jumbo FramesAdditional Network ConsiderationsSummary
Appendix A. Hadoop Cluster Creation: A Prerequisite Checklist
Appendix B. Big Data/Hadoop on VMware vSphere Reference Materials
Deployment GuidesReference ArchitecturesCustomer Case StudiesPerformancevSphere Big Data Extensions (BDE)Other vSphere Features and Big Data
Index
Code Snippets

Content preview from Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Chapter 6. Hadoop SQL Engines

Data is the new oil. No: Data is the new soil.

—David McCandless

One of the biggest decisions in the design of a Hadoop ecosystem is selecting the SQL engines for the use cases. You have to ask yourself, for different types of applications and projects, should we use Hive on Tez, Impala, Spark SQL, Phoenix for HBase, and so on? The decision gets harder as each new release adds functionality that overlaps other SQL engines. In this chapter we discuss Hadoop SQL engines and two of the primary tools that use these engines, Hive and Pig.

Where SQL Was Born

In the early days of computing, everything was file based and only geeks could parse and process such data. With RDBMSs, SQL became the universal language of data ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780133812350Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

by George J. Trujillo Jr., Charles Kim, Steven Jones, Rommel Garcia, Justin Murray