O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Database Reliability Engineering

Book Description

The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations. Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE).

You’ll begin by exploring core operational concepts that DBREs need to master. Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database.

This book covers:

  • Service-level requirements and risk management
  • Building and evolving an architecture for operational visibility
  • Infrastructure engineering and infrastructure management
  • How to facilitate the release management process
  • Data storage, indexing, and replication
  • Identifying datastore characteristics and best use cases
  • Datastore architectural components and data-driven architectures

Table of Contents

  1. Foreword
  2. Preface
    1. Why We Wrote This Book
    2. Who This Book Is For
    3. How This Book Is Organized
    4. Conventions Used in This Book
    5. O’Reilly Safari
    6. How to Contact Us
  3. 1. Introducing Database Reliability Engineering
    1. Guiding Principles of the DBRE
      1. Protect the Data
      2. Self-Service for Scale
      3. Elimination of Toil
      4. Databases Are Not Special Snowflakes
      5. Eliminate the Barriers Between Software and Operations
    2. Operations Core Overview
    3. Hierarchy of Needs
      1. Survival and Safety
      2. Love and Belonging
      3. Esteem
      4. Self-actualization
    4. Wrapping Up
  4. 2. Service-Level Management
    1. Why Do I Need Service-Level Objectives?
    2. Service-Level Indicators
      1. Latency
      2. Availability
      3. Throughput
      4. Durability
      5. Cost or Efficiency
    3. Defining Service Objectives
      1. Latency Indicators
      2. Availability Indicators
      3. Throughput Indicators
    4. Monitoring and Reporting on SLOs
      1. Monitoring Availability
      2. Monitoring Latency
      3. Monitoring Throughput
      4. Monitoring Cost and Efficiency
    5. Wrapping Up
  5. 3. Risk Management
    1. Risk Considerations
      1. Unknown Factors and Complexity
      2. Availability of Resources
      3. Human Factors
      4. Group Factors
    2. What Do We Do?
    3. What Not to Do
    4. A Working Process: Bootstrapping
      1. Service Risk Evaluation
      2. Architectural Inventory
      3. Prioritization
        1. Control and Decision Making
    5. Ongoing Iterations
    6. Wrapping Up
  6. 4. Operational Visibility
    1. The New Rules of Operational Visibility
      1. Treat OpViz Systems Like BI Systems
      2. Distributed Ephemeral Environments Trending to the Norm
      3. Store at High Resolutions for Key Metrics
      4. Keep Your Architecture Simple
    2. An OpViz Framework
    3. Data In
      1. Telemetry/Metrics
      2. Events
      3. Logs
    4. Data Out
    5. Bootstrapping Your Monitoring
      1. Is the Data Safe?
      2. Is the Service Up?
      3. Are the Consumers in Pain?
    6. Instrumenting the Application
      1. Distributed Tracing
      2. Events and Logs
    7. Instrumenting the Server or Instance
      1. Events and Logs
    8. Instrumenting the Datastore
    9. Datastore Connection Layer
      1. Utilization
      2. Saturation
      3. Errors
    10. Internal Database Visibility
      1. Throughput and Latency Metrics
      2. Commits, Redo, and Journaling
      3. Replication State
      4. Memory Structures
      5. Locking and Concurrency
    11. Database Objects
    12. Database Queries
    13. Database Asserts and Events
    14. Wrapping Up
  7. 5. Infrastructure Engineering
    1. Hosts
      1. Physical Servers
      2. Operating a System and Kernel
      3. Storage Area Networks
      4. Benefits of Physical Servers
      5. Cons of Physical Servers
    2. Virtualization
      1. Hypervisor
      2. Concurrency
      3. Storage
      4. Use Cases
    3. Containers
    4. Database as a Service
      1. Challenges of DBaaS
      2. The DBRE and the DBaaS
    5. Wrapping Up
  8. 6. Infrastructure Management
    1. Version Control
    2. Configuration Definition
    3. Building from Configuration
    4. Maintaining Configuration
      1. Enforcement of Configuration Definitions
    5. Infrastructure Definition and Orchestration
      1. Monolithic Infrastructure Definitions
      2. Separating Vertically
      3. Separated Tiers (Horizontal Definitions)
    6. Acceptance Testing and Compliance
    7. Service Catalog
    8. Bringing It All Together
    9. Development Environments
    10. Wrapping Up
  9. 7. Backup and Recovery
    1. Core Concepts
      1. Physical versus Logical
      2. Online versus Offline
      3. Full, Incremental, and Differential
    2. Considerations for Recovery
    3. Recovery Scenarios
      1. Planned Recovery Scenarios
      2. Unplanned Scenarios
      3. Scenario scope
      4. Scenario Impact
    4. Anatomy of a Recovery Strategy
      1. Building Block 1: Detection
      2. Building Block 2: Tiered Storage
      3. Building Block 3: A Varied Toolbox
      4. Building Block 4: Testing
    5. A Recovery Strategy Defined
      1. Online, Fast Storage with Full and Incremental Backups
      2. Online, Slow Storage with Full and Incremental Backups
      3. Offline Storage
      4. Object Storage
    6. Wrapping Up
  10. 8. Release Management
    1. Education and Collaboration
      1. Become a Funnel
      2. Foster Conversations
      3. Domain-Specific Knowledge
      4. Collaboration
    2. Integration
      1. Prerequisites
    3. Testing
      1. Test-Friendly Development Practices
      2. Post-Commit Testing
      3. Full Dataset Testing
      4. Downstream Tests
      5. Operational Tests
    4. Deployment
      1. Migrations and Versioning
      2. Impact Analysis
      3. Migration Patterns
      4. Manual or Automated
    5. Wrapping Up
  11. 9. Security
    1. The Purpose of Security
      1. Protecting Data from Theft
      2. Protecting from Purposeful Damage
      3. Protecting from Accidental Damage
      4. Protecting Data from Exposure
      5. Compliance and Auditing Standards
    2. Database Security as a Function
      1. Education and Collaboration
      2. Self-Service
      3. Integration and Testing
      4. Operational Visibility
    3. Vulnerabilities and Exploits
      1. STRIDE
      2. DREAD
      3. Basic Precautions
      4. Denial of Service
      5. SQL Injection
      6. Network and Authentication Protocols
    4. Encryption of Data
      1. Financial Data
      2. Personal Health Data
      3. Private Individual Data
      4. Military or Government Data
      5. Confidential/Sensitive Business Data
      6. Data in Transit
      7. Data in the Database
      8. Data in the Filesystem
    5. Wrapping Up
  12. 10. Data Storage, Indexing, and Replication
    1. Data Structure Storage
      1. Database Row Storage
      2. Sorted-String Tables and Log-Structured Merge Trees
      3. Indexing
      4. Logs and Databases
    2. Data Replication
      1. Single-Leader
      2. Multi-Leader Replication
    3. Wrapping Up
  13. 11. Datastore Field Guide
    1. Conceptual Attributes of a Datastore
      1. The Data Model
      2. Transactions
      3. BASE
    2. Internal Attributes of a Datastore
      1. Storage
      2. The Ubiquitous CAP Theorem Section
      3. Consistency Latency Trade-offs
      4. Availability
    3. Wrapping Up
  14. 12. A Data Architecture Sampler
    1. Architectural Components
      1. Frontend Datastores
      2. Data Access Layer
      3. Database Proxies
      4. Event and Message Systems
      5. Caches and Memory Stores
    2. Data Architectures
      1. Lambda and Kappa
      2. Event Sourcing
      3. CQRS
    3. Wrapping Up
  15. 13. Making the Case For DBRE
    1. A Culture of Database Reliability
      1. Breaking-Down Barriers
      2. Data-Driven Decision Making
      3. Data Integrity and Recoverability
    2. Wrapping Up
  16. Index