O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Database Internals

Book Description

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it’s often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals.

Throughout the book, you’ll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You’ll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed.

This book examines:

  • Storage engines: Explore storage classification and taxonomy, and dive into B-Tree-based and immutable Log Structured storage engines, with differences and use-cases for each
  • Storage building blocks: Learn how database files are organized to build efficient storage, using auxiliary data structures such as Page Cache, Buffer Pool and Write-Ahead Log
  • Distributed systems: Learn step-by-step how nodes and processes connect and build complex communication patterns
  • Database clusters: Which consistency models are commonly used by modern databases and how distributed storage systems achieve consistency

Table of Contents

  1. Preface
    1. How to Contact Us
  2. I. Storage Engines
  3. 1. Introduction and Overview
    1. DBMS Architecture
    2. Memory- Versus Disk-Based DBMS
      1. Durability in Memory-Based Stores
    3. Column- Versus Row-Oriented DBMS
      1. Row-Oriented Data Layout
      2. Column-Oriented Data Layout
      3. Distinctions and Optimizations
      4. Wide Column Stores
    4. Data Files and Index Files
      1. Data Files
      2. Index Files
      3. Primary Index as an Indirection
    5. Buffering, Immutability, and Ordering
    6. Summary
  4. 2. B-Tree Basics
    1. Binary Search Trees
      1. Tree Balancing
      2. Trees for Disk-Based Storage
    2. Disk-Based Structures
      1. Hard Disk Drives
      2. Solid State Drives
      3. On-Disk Structures
    3. Ubiquitous B-Trees
      1. B-Tree Hierarchy
      2. Separator Keys
      3. B-Tree Lookup Complexity
      4. B-Tree Lookup Algorithm
      5. Counting Keys
      6. B-Tree Node Splits
      7. B-Tree Node Merges
    4. Summary
  5. 3. File Formats
    1. Motivation
    2. Binary Encoding
      1. Primitive Types
      2. Strings and Variable-Size Data
      3. Bit-Packed Data: Booleans, Enums, and Flags
    3. General Principles
    4. Page Structure
    5. Slotted Pages
    6. Cell Layout
    7. Combining Cells into Slotted Pages
    8. Managing Variable-Size Data
    9. Versioning
    10. Checksumming
    11. Summary
  6. 4. Implementing B-Trees
    1. Page Header
      1. Magic Numbers
      2. Sibling Links
      3. Rightmost Pointers
      4. Node High Keys
      5. Overflow Pages
    2. Binary Search
      1. Binary Search with Indirection Pointers
    3. Propagating Splits and Merges
      1. Breadcrumbs
    4. Rebalancing
    5. Right-Only Appends
      1. Bulk Loading
    6. Compression
    7. Vacuum and Maintenance
      1. Fragmentation Caused by Updates and Deletes
      2. Page Defragmentation
    8. Summary
  7. 5. Transaction Processing and Recovery
    1. Buffer Management
      1. Caching Semantics
      2. Cache Eviction
      3. Locking Pages in Cache
      4. Page Replacement
    2. Recovery
      1. Log Semantics
      2. Operation Versus Data Log
      3. Steal and Force Policies
      4. ARIES
    3. Concurrency Control
      1. Serializability
      2. Transaction Isolation
      3. Read and Write Anomalies
      4. Isolation Levels
      5. Optimistic Concurrency Control
      6. Multiversion Concurrency Control
      7. Pessimistic Concurrency Control
      8. Lock-Based Concurrency Control
    4. Summary
  8. 6. B-Tree Variants
    1. Copy-on-Write
      1. Implementing Copy-on-Write: LMDB
    2. Abstracting Node Updates
    3. Lazy B-Trees
      1. WiredTiger
      2. Lazy-Adaptive Tree
    4. FD-Trees
      1. Fractional Cascading
      2. Logarithmic Runs
    5. Bw-Trees
      1. Update Chains
      2. Taming Concurrency with Compare-and-Swap
      3. Structural Modification Operations
      4. Consolidation and Garbage Collection
    6. Cache-Oblivious B-Trees
      1. van Emde Boas Layout
    7. Summary
  9. 7. Log-Structured Storage
    1. LSM Trees
      1. LSM Tree Structure
      2. Updates and Deletes
      3. LSM Tree Lookups
      4. Merge-Iteration
      5. Reconciliation
      6. Maintenance in LSM Trees
    2. Read, Write, and Space Amplification
      1. RUM Conjecture
    3. Implementation Details
      1. Sorted String Tables
      2. Bloom Filters
      3. Skiplist
      4. Disk Access
      5. Compression
    4. Unordered LSM Storage
      1. Bitcask
      2. WiscKey
    5. Concurrency in LSM Trees
    6. Log Stacking
      1. Flash Translation Layer
      2. Filesystem Logging
    7. LLAMA and Mindful Stacking
      1. Open-Channel SSDs
    8. Summary
  10. Part I Conclusion
  11. II. Distributed Systems
  12. 8. Introduction and Overview
    1. Concurrent Execution
      1. Shared State in a Distributed System
    2. Fallacies of Distributed Computing
      1. Processing
      2. Clocks and Time
      3. State Consistency
      4. Local and Remote Execution
      5. Need to Handle Failures
      6. Network Partitions and Partial Failures
      7. Cascading Failures
    3. Distributed Systems Abstractions
      1. Links
    4. Two Generals’ Problem
    5. FLP Impossibility
    6. System Synchrony
    7. Failure Models
      1. Crash Faults
      2. Omission Faults
      3. Arbitrary Faults
      4. Handling Failures
      5. Summary
  13. 9. Failure Detection
    1. Heartbeats and Pings
      1. Timeout-Free Failure Detector
      2. Outsourced Heartbeats
    2. Phi-Accural Failure Detector
    3. Gossip and Failure Detection
    4. Reversing Failure Detection Problem Statement
    5. Summary
  14. 10. Leader Election
    1. Bully Algorithm
    2. Next-In-Line Failover
    3. Candidate/Ordinary Optimization
    4. Invitation Algorithm
    5. Ring Algorithm
    6. Summary
  15. 11. Replication and Consistency
    1. Achieving Availability
    2. Infamous CAP
      1. Use CAP Carefully
      2. Harvest and Yield
    3. Shared Memory
    4. Ordering
    5. Consistency Models
      1. Strict Consistency
      2. Linearizability
      3. Sequential Consistency
      4. Causal Consistency
    6. Session Models
    7. Eventual Consistency
    8. Tunable Consistency
    9. Witness Replicas
    10. Strong Eventual Consistency and CRDTs
    11. Summary
  16. 12. Anti-Entropy and Dissemination
    1. Read Repair
    2. Digest Reads
    3. Hinted Handoff
    4. Merkle Trees
    5. Bitmap Version Vectors
    6. Gossip Dissemination
      1. Gossip Mechanics
      2. Overlay Networks
      3. Hybrid Gossip
      4. Partial Views
    7. Summary
  17. 13. Distributed Transactions
    1. Making Operations Appear Atomic
    2. Two-Phase Commit
      1. Cohort Failures in 2PC
      2. Coordinator Failures in 2PC
    3. Three-Phase Commit
      1. Coordinator Failures in 3PC
    4. Distributed Transactions with Calvin
    5. Distributed Transactions with Spanner
    6. Database Partitioning
      1. Consistent Hashing
    7. Distributed Transactions with Percolator
    8. Coordination Avoidance
    9. Summary
  18. 14. Consensus
    1. Broadcast
    2. Atomic Broadcast
      1. Virtual Synchrony
      2. Zookeeper Atomic Broadcast (ZAB)
    3. Paxos
      1. Paxos Algorithm
      2. Quorums in Paxos
      3. Failure Scenarios
      4. Multi-Paxos
      5. Fast Paxos
      6. Egalitarian Paxos
      7. Flexible Paxos
      8. Generalized Solution to Consensus
    4. Raft
      1. Leader Role in Raft
      2. Failure Scenarios
    5. Byzantine Consensus
      1. PBFT Algorithm
      2. Recovery and Checkpointing
    6. Summary
  19. Part II Conclusion
  20. A. Bibliography
  21. Index