O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Architecting Modern Data Platforms

Book Description

There’s a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you’ll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.

Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You’ll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:

  • Infrastructure: Look at all component layers in a modern data platform, from the server to the data center, to establish a solid foundation for data in your enterprise
  • Platform: Understand aspects of deployment, operation, security, high availability, and disaster recovery, along with everything you need to know to integrate your platform with the rest of your enterprise IT
  • Taking Hadoop to the cloud: Learn the important architectural aspects of running a big data platform in the cloud while maintaining enterprise security and high availability

Table of Contents

  1. Foreword
  2. Preface
    1. Some Misconceptions
    2. Some General Trends
      1. Horizontal Scaling
      2. Adoption of Open Source
      3. Embracing Cloud Compute
      4. Decoupled Compute and Storage
    3. What Is This Book About?
    4. Who Should Read This Book?
    5. The Road Ahead
    6. Conventions Used in This Book
    7. O’Reilly Safari
    8. How to Contact Us
    9. Acknowledgments
  3. 1. Big Data Technology Primer
    1. A Tour of the Landscape
      1. Core Components
      2. Computational Frameworks
      3. Analytical SQL Engines
      4. Storage Engines
      5. Ingestion
      6. Orchestration
    2. Summary
  4. I. Infrastructure
  5. 2. Clusters
    1. Reasons for Multiple Clusters
      1. Multiple Clusters for Resiliency
      2. Multiple Clusters for Software Development
      3. Multiple Clusters for Workload Isolation
      4. Multiple Clusters for Legal Separation
      5. Multiple Clusters and Independent Storage and Compute
    2. Multitenancy
      1. Requirements for Multitenancy
    3. Sizing Clusters
      1. Sizing by Storage
      2. Sizing by Ingest Rate
      3. Sizing by Workload
    4. Cluster Growth
      1. The Drivers of Cluster Growth
      2. Implementing Cluster Growth
    5. Data Replication
      1. Replication for Software Development
      2. Replication and Workload Isolation
    6. Summary
  6. 3. Compute and Storage
    1. Computer Architecture for Hadoop
      1. Commodity Servers
      2. Server CPUs and RAM
      3. Nonuniform Memory Access
      4. CPU Specifications
      5. RAM
    2. Commoditized Storage Meets the Enterprise
      1. Modularity of Compute and Storage
      2. Everything Is Java
      3. Replication or Erasure Coding?
      4. Alternatives
    3. Hadoop and the Linux Storage Stack
      1. User Space
      2. Important System Calls
      3. The Linux Page Cache
      4. Short-Circuit and Zero-Copy Reads
      5. Filesystems
    4. Erasure Coding Versus Replication
      1. Discussion
      2. Guidance
    5. Low-Level Storage
      1. Storage Controllers
      2. Disk Layer
    6. Server Form Factors
      1. Form Factor Comparison
      2. Guidance
    7. Workload Profiles
    8. Cluster Configurations and Node Types
      1. Master Nodes
      2. Worker Nodes
      3. Utility Nodes
      4. Edge Nodes
      5. Small Cluster Configurations
      6. Medium Cluster Configurations
      7. Large Cluster Configurations
    9. Summary
  7. 4. Networking
    1. How Services Use a Network
      1. Remote Procedure Calls (RPCs)
      2. Data Transfers
      3. Monitoring
      4. Backup
      5. Consensus
    2. Network Architectures
      1. Small Cluster Architectures
      2. Medium Cluster Architectures
      3. Large Cluster Architectures
    3. Network Integration
      1. Reusing an Existing Network
      2. Creating an Additional Network
    4. Network Design Considerations
      1. Layer 1 Recommendations
      2. Layer 2 Recommendations
      3. Layer 3 Recommendations
    5. Summary
  8. 5. Organizational Challenges
    1. Who Runs It?
    2. Is It Infrastructure, Middleware, or an Application?
    3. Case Study: A Typical Business Intelligence Project
      1. The Traditional Approach
      2. Typical Team Setup
      3. Compartmentalization of IT
      4. Revised Team Setup for Hadoop in the Enterprise
      5. Solution Overview with Hadoop
      6. New Team Setup
      7. Split Responsibilities
      8. Do I Need DevOps?
      9. Do I Need a Center of Excellence/Competence?
    4. Summary
  9. 6. Datacenter Considerations
    1. Why Does It Matter ?
    2. Basic Datacenter Concepts
      1. Cooling
      2. Power
      3. Network
      4. Rack Awareness and Rack Failures
      5. Failure Domain Alignment
    3. Space and Racking Constraints
    4. Ingest and Intercluster Connectivity
      1. Software
      2. Hardware
    5. Replacements and Repair
      1. Operational Procedures
    6. Typical Pitfalls
      1. Networking
      2. Cluster Spanning
    7. Summary
  10. II. Platform
  11. 7. Provisioning Clusters
    1. Operating Systems
      1. OS Choices
      2. OS Configuration for Hadoop
      3. Automated Configuration Example
    2. Service Databases
      1. Required Databases
      2. Database Integration Options
      3. Database Considerations
    3. Hadoop Deployment
      1. Hadoop Distributions
      2. Installation Choices
      3. Distribution Architecture
      4. Installation Process
    4. Summary
  12. 8. Platform Validation
    1. Testing Methodology
    2. Useful Tools
    3. Hardware Validation
      1. CPU
      2. Disks
      3. Network
    4. Hadoop Validation
      1. HDFS Validation
      2. General Validation
    5. Validating Other Components
      1. Operations Validation
    6. Summary
  13. 9. Security
    1. In-Flight Encryption
      1. TLS Encryption
      2. SASL Quality of Protection
      3. Enabling in-Flight Encryption
    2. Authentication
      1. Kerberos
      2. LDAP Authentication
      3. Delegation Tokens
      4. Impersonation
    3. Authorization
      1. Group Resolution
      2. Superusers and Supergroups
      3. Hadoop Service Level Authorization
      4. Centralized Security Management
      5. HDFS
      6. YARN
      7. ZooKeeper
      8. Hive
      9. Impala
      10. HBase
      11. Solr
      12. Kudu
      13. Oozie
      14. Hue
      15. Kafka
      16. Sentry
    4. At-Rest Encryption
      1. Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
      2. HDFS Transparent Data Encryption
      3. Encrypting Temporary Files
    5. Summary
  14. 10. Integration with Identity Management Providers
    1. Integration Areas
    2. Integration Scenarios
      1. Scenario 1: Writing a File to HDFS
      2. Scenario 2: Submitting a Hive Query
      3. Scenario 3: Running a Spark Job
    3. Integration Providers
    4. LDAP Integration
      1. Background
      2. LDAP Security
      3. Load Balancing
      4. Application Integration
      5. Linux Integration
    5. Kerberos Integration
      1. Kerberos Clients
      2. KDC Integration
    6. Certificate Management
      1. Signing Certificates
      2. Converting Certificates
      3. Wildcard Certificates
      4. Automation
    7. Summary
  15. 11. Accessing and Interacting with Clusters
    1. Access Mechanisms
      1. Programmatic Access
      2. Command-Line Access
      3. Web UIs
    2. Access Topologies
      1. Interaction Patterns
      2. Proxy Access
      3. Load Balancing
      4. Edge Node Interactions
    3. Access Security
      1. Administration Gateways
    4. Workbenches
      1. Hue
      2. Notebooks
    5. Landing Zones
    6. Summary
  16. 12. High Availability
    1. High Availability Defined
      1. Lateral/Service HA
      2. Vertical/Systemic HA
    2. Measuring Availability
      1. Percentages
      2. Percentiles
    3. Operating for HA
      1. Monitoring
      2. Playbooks and Postmortems
    4. HA Building Blocks
      1. Quorums
      2. Load Balancing
      3. Database HA
      4. Ancillary Services
    5. General Considerations
      1. Separation of Master and Worker Processes
      2. Separation of Identical Service Roles
      3. Master Servers in Separate Failure Domains
      4. Balanced Master Configurations
      5. Optimized Server Configurations
    6. High Availability of Cluster Services
      1. ZooKeeper
      2. HDFS
      3. YARN
      4. HBase
      5. KMS
      6. Hive
      7. Impala
      8. Solr
      9. Kafka
      10. Oozie
      11. Hue
      12. Other Services
      13. Autoconfiguration
    7. Summary
  17. 13. Backup and Disaster Recovery
    1. Context
      1. Many Distributed Systems
      2. Policies and Objectives
      3. Failure Scenarios
      4. Suitable Data Sources
      5. Strategies
      6. Data Types
      7. Consistency
      8. Validation
      9. Summary
    2. Data Replication
      1. HBase
      2. Cluster Management Tools
      3. Kafka
      4. Summary
    3. Hadoop Cluster Backups
      1. Subsystems
      2. Case Study: Automating Backups with Oozie
    4. Restore
    5. Summary
  18. III. Taking Hadoop to the Cloud
  19. 14. Basics of Virtualization for Hadoop
    1. Compute Virtualization
      1. Virtual Machine Distribution
      2. Anti-Affinity Groups
    2. Storage Virtualization
      1. Virtualizing Local Storage
      2. SANs
      3. Object Storage and Network-Attached Storage
    3. Network Virtualization
    4. Cluster Life Cycle Models
    5. Summary
  20. 15. Solutions for Private Clouds
    1. OpenStack
      1. Automation and Integration
      2. Life Cycle and Storage
      3. Isolation
      4. Summary
    2. OpenShift
      1. Automation
      2. Life Cycle and Storage
      3. Isolation
      4. Summary
    3. VMware and Pivotal Cloud Foundry
    4. Do It Yourself?
      1. Automation
      2. Isolation
      3. Life Cycle Model
      4. Summary
    5. Object Storage for Private Clouds
      1. EMC Isilon
      2. Ceph
    6. Summary
  21. 16. Solutions in the Public Cloud
    1. Key Things to Know
    2. Cloud Providers
      1. AWS
      2. Microsoft Azure
      3. Google Cloud Platform
    3. Implementing Clusters
      1. Instances
      2. Storage and Life Cycle Models
      3. Network Architecture
      4. High Availability
    4. Summary
  22. 17. Automated Provisioning
    1. Long-Lived Clusters
      1. Configuration and Templating
      2. Deployment Phases
      3. Vendor Solutions
      4. One-Click Deployments
      5. Homegrown Automation
      6. Hooking Into a Provisioning Life Cycle
      7. Scaling Up and Down
      8. Deploying with Security
    2. Transient Clusters
    3. Sharing Metadata Services
    4. Summary
  23. 18. Security in the Cloud
    1. Assessing the Risk
    2. Risk Model
      1. Environmental Risks
      2. Deployment Risks
    3. Identity Provider Options for Hadoop
      1. Option A: Cloud-Only Self-Contained ID Services
      2. Option B: Cloud-Only Shared ID Services
      3. Option C: On-Premises ID Services
    4. Object Storage Security and Hadoop
      1. Identity and Access Management
      2. Amazon Simple Storage Service
      3. GCP Cloud Storage
      4. Microsoft Azure
    5. Auditing
    6. Encryption for Data at Rest
      1. Requirements for Key Material
      2. Options for Encryption in the Cloud
      3. On-Premises Key Persistence
      4. Encryption via the Cloud Provider
      5. Encryption Feature and Interoperability Summary
      6. Recommendations and Summary for Cloud Encryption
    7. Encrypting Data in Flight in the Cloud
    8. Perimeter Controls and Firewalling
      1. GCP
      2. AWS
      3. Azure
    9. Summary
  24. A. Backup Onboarding Checklist
    1. Backup Onboarding Checklist
      1. Backup
    2. Services
      1. Cloudera Manager
      2. HDFS
      3. HBase
      4. Hive/Impala
      5. Sqoop
      6. Oozie
      7. Hue
      8. Sentry
  25. Index