book

Database Internals

by Alex Petrov

October 2019

Intermediate to advanced

370 pages

10h 23m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

How to Contact Us
DBMS ArchitectureMemory- Versus Disk-Based DBMSDurability in Memory-Based StoresColumn- Versus Row-Oriented DBMSRow-Oriented Data LayoutColumn-Oriented Data LayoutDistinctions and OptimizationsWide Column StoresData Files and Index FilesData FilesIndex FilesPrimary Index as an IndirectionBuffering, Immutability, and OrderingSummary
Binary Search TreesTree BalancingTrees for Disk-Based StorageDisk-Based StructuresHard Disk DrivesSolid State DrivesOn-Disk StructuresUbiquitous B-TreesB-Tree HierarchySeparator KeysB-Tree Lookup ComplexityB-Tree Lookup AlgorithmCounting KeysB-Tree Node SplitsB-Tree Node MergesSummary
MotivationBinary EncodingPrimitive TypesStrings and Variable-Size DataBit-Packed Data: Booleans, Enums, and FlagsGeneral PrinciplesPage StructureSlotted PagesCell LayoutCombining Cells into Slotted PagesManaging Variable-Size DataVersioningChecksummingSummary
Page HeaderMagic NumbersSibling LinksRightmost PointersNode High KeysOverflow PagesBinary SearchBinary Search with Indirection PointersPropagating Splits and MergesBreadcrumbsRebalancingRight-Only AppendsBulk LoadingCompressionVacuum and MaintenanceFragmentation Caused by Updates and DeletesPage DefragmentationSummary
Buffer ManagementCaching SemanticsCache EvictionLocking Pages in CachePage ReplacementRecoveryLog SemanticsOperation Versus Data LogSteal and Force PoliciesARIESConcurrency ControlSerializabilityTransaction IsolationRead and Write AnomaliesIsolation LevelsOptimistic Concurrency ControlMultiversion Concurrency ControlPessimistic Concurrency ControlLock-Based Concurrency ControlSummary
Copy-on-WriteImplementing Copy-on-Write: LMDBAbstracting Node UpdatesLazy B-TreesWiredTigerLazy-Adaptive TreeFD-TreesFractional CascadingLogarithmic RunsBw-TreesUpdate ChainsTaming Concurrency with Compare-and-SwapStructural Modification OperationsConsolidation and Garbage CollectionCache-Oblivious B-Treesvan Emde Boas LayoutSummary
LSM TreesLSM Tree StructureUpdates and DeletesLSM Tree LookupsMerge-IterationReconciliationMaintenance in LSM TreesRead, Write, and Space AmplificationRUM ConjectureImplementation DetailsSorted String TablesBloom FiltersSkiplistDisk AccessCompressionUnordered LSM StorageBitcaskWiscKeyConcurrency in LSM TreesLog StackingFlash Translation LayerFilesystem LoggingLLAMA and Mindful StackingOpen-Channel SSDsSummary

Concurrent ExecutionShared State in a Distributed SystemFallacies of Distributed ComputingProcessingClocks and TimeState ConsistencyLocal and Remote ExecutionNeed to Handle FailuresNetwork Partitions and Partial FailuresCascading FailuresDistributed Systems AbstractionsLinksTwo Generals’ ProblemFLP ImpossibilitySystem SynchronyFailure ModelsCrash FaultsOmission FaultsArbitrary FaultsHandling FailuresSummary
Heartbeats and PingsTimeout-Free Failure DetectorOutsourced HeartbeatsPhi-Accrual Failure DetectorGossip and Failure DetectionReversing Failure Detection Problem StatementSummary
Bully AlgorithmNext-In-Line FailoverCandidate/Ordinary OptimizationInvitation AlgorithmRing AlgorithmSummary
Achieving AvailabilityInfamous CAPUse CAP CarefullyHarvest and YieldShared MemoryOrderingConsistency ModelsStrict ConsistencyLinearizabilitySequential ConsistencyCausal ConsistencySession ModelsEventual ConsistencyTunable ConsistencyWitness ReplicasStrong Eventual Consistency and CRDTsSummary
Read RepairDigest ReadsHinted HandoffMerkle TreesBitmap Version VectorsGossip DisseminationGossip MechanicsOverlay NetworksHybrid GossipPartial ViewsSummary
Making Operations Appear AtomicTwo-Phase CommitCohort Failures in 2PCCoordinator Failures in 2PCThree-Phase CommitCoordinator Failures in 3PCDistributed Transactions with CalvinDistributed Transactions with SpannerDatabase PartitioningConsistent HashingDistributed Transactions with PercolatorCoordination AvoidanceSummary
BroadcastAtomic BroadcastVirtual SynchronyZookeeper Atomic Broadcast (ZAB)PaxosPaxos AlgorithmQuorums in PaxosFailure ScenariosMulti-PaxosFast PaxosEgalitarian PaxosFlexible PaxosGeneralized Solution to ConsensusRaftLeader Role in RaftFailure ScenariosByzantine ConsensusPBFT AlgorithmRecovery and CheckpointingSummary

Content preview from Database Internals

Part I Conclusion

In Part I, we’ve been talking about storage engines. We started from high-level database system architecture and classification, learned how to implement on-disk storage structures, and how they fit into the full picture with other components.

We’ve seen several storage structures, starting from B-Trees. The discussed structures do not represent an entire field, and there are many other interesting developments. However, these examples are still a good illustration of the three properties we identified at the beginning of this part: buffering, immutability, and ordering. These properties are useful for describing, memorizing, and expressing different aspects of the storage structures.

Figure I-1 summarizes the discussed storage structures and shows whether or not they’re using these properties.

Adding in-memory buffers always has a positive impact on write amplification. In in-place update structures like WiredTiger and LA-Trees, in-memory buffering helps to amortize the cost of multiple same-page writes by combining them. In other words, buffering helps to reduce write amplification.

In immutable structures, such as multicomponent LSM Trees and FD-Trees, buffering has a similar positive effect, but at a cost of future rewrites when moving data from one immutable level to the other. In other words, using immutability may lead to deferred write amplification. At the same time, using immutability has a positive impact on concurrency and space amplification, since ...