book

Designing Data-Intensive Applications, 2nd Edition

by Martin Kleppmann, Chris Riccomini

February 2026

Intermediate to advanced

672 pages

21h 56m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Who Should Read This Book?What’s New in the Second Edition?References and Further ReadingConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Trade-Offs in Data Systems Architecture
Operational Versus Analytical SystemsCharacterizing Transaction Processing and AnalyticsData WarehousingSystems of Record and Derived DataCloud Versus Self-HostingPros and Cons of Cloud ServicesCloud Native System ArchitectureOperations in the Cloud EraDistributed Versus Single-Node SystemsProblems with Distributed SystemsMicroservices and ServerlessCloud Computing Versus SupercomputingData Systems, Law, and SocietySummary
2. Defining Nonfunctional Requirements
Case Study: Social Network Home TimelinesRepresenting Users, Posts, and FollowsMaterializing and Updating TimelinesDescribing PerformanceLatency and Response TimeAverage, Median, and PercentilesUse of Response Time MetricsReliability and Fault ToleranceFault ToleranceHardware and Software FaultsHumans and ReliabilityScalabilityUnderstanding LoadShared-Memory, Shared-Disk, and Shared-Nothing ArchitecturesPrinciples for ScalabilityMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
3. Data Models and Query Languages
Relational Versus Document ModelsThe Object-Relational MismatchNormalization, Denormalization, and JoinsMany-to-One and Many-to-Many RelationshipsStars and Snowflakes: Schemas for AnalyticsWhen to Use Which ModelGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple Stores and SPARQLDatalog: Recursive Relational QueriesGraphQLEvent Sourcing and CQRSDataFrames, Matrices, and ArraysSummary
4. Storage and Retrieval
Storage and Indexing for OLTPLog-Structured StorageB-TreesComparing B-Trees and LSM-TreesMulticolumn and Secondary IndexesStoring Values Within the IndexKeeping Everything in MemoryData Storage for AnalyticsCloud Data WarehousesColumn-Oriented StorageQuery Execution: Compilation and VectorizationMaterialized Views and Data CubesMultidimensional and Full-Text IndexesFull-Text SearchVector EmbeddingsSummary
5. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsProtocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCDurable Execution and WorkflowsEvent-Driven ArchitecturesSummary
6. Replication
Single-Leader ReplicationSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagSolutions for Replication LagMulti-Leader ReplicationGeographically Distributed OperationSync Engines and Local-First SoftwareDealing with Conflicting WritesLeaderless ReplicationWriting to the Database When a Node Is DownSingle-Leader Versus Leaderless Replication PerformanceMulti-Region OperationDetecting Concurrent WritesSummary
7. Sharding
Pros and Cons of ShardingSharding for MultitenancySharding of Key-Value DataSharding by Key RangeSharding by Hash of KeySkewed Workloads and Relieving Hot SpotsOperations: Automatic Versus Manual RebalancingRequest RoutingSharding and Secondary IndexesLocal Secondary IndexesGlobal Secondary IndexesSummary
8. Transactions
What Exactly Is a Transaction?The Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase LockingSerializable Snapshot IsolationDistributed TransactionsTwo-Phase CommitDistributed Transactions Across Different SystemsDatabase-Internal Distributed TransactionsExactly-Once Message Processing RevisitedSummary
9. The Trouble with Distributed Systems
Faults and Partial FailuresUnreliable NetworksThe Limitations of TCPNetwork Faults in PracticeFault DetectionTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Majority RulesDistributed Locks and LeasesByzantine FaultsSystem Model and RealityFormal Methods and Randomized TestingSummary

10. Consistency and Consensus
LinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityID Generators and Logical ClocksLogical ClocksLinearizable ID GeneratorsConsensusThe Many Faces of ConsensusConsensus in PracticeCoordination ServicesSummary
11. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisChain of Commands Versus Custom ProgramSorting Versus In-Memory AggregationBatch Processing in Distributed SystemsDistributed FilesystemsObject StoresDistributed Job OrchestrationBatch Processing ModelsMapReduceDataflow EnginesShuffling DataJoins and GroupingQuery LanguagesDataFramesBatch Use CasesExtract–Transform–LoadAnalyticsMachine LearningServing Derived DataSummary
12. Stream Processing
Transmitting Event StreamsMessaging SystemsLog-Based Message BrokersDatabases and StreamsKeeping Systems in SyncChange Data CaptureState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
13. A Philosophy of Streaming Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifySummary
14. Doing the Right Thing
Predictive AnalyticsBias and DiscriminationResponsibility and AccountabilityFeedback LoopsPrivacy and TrackingSurveillanceConsent and Freedom of ChoicePrivacy and Use of DataData as Assets and PowerRemembering the Industrial RevolutionLegislation and Self-RegulationSummary
Glossary
Index
About the Authors

Content preview from Designing Data-Intensive Applications, 2nd Edition

Chapter 9. The Trouble with Distributed Systems

They’re funny things, Accidents. You never have them till you’re having them.

A.A. Milne, The House at Pooh Corner (1928)

As discussed in “Reliability and Fault Tolerance”, making a system reliable means ensuring that the system as a whole continues working, even when things go wrong (i.e., when there is a fault). However, anticipating all the possible faults and handling them is not that easy. As a developer, it is very tempting to focus mostly on the happy path (after all, most of the time things work fine!) and to neglect faults, since they introduce a lot of edge cases.

If you want your system to be reliable in the presence of faults, you have to radically change your mindset and focus on what could go wrong, even though it may be unlikely. It doesn’t matter whether there is only a one-in-a-million chance; in a large enough system, one-in-a-million events happen every day. Experienced systems operators will tell you that anything that can go wrong will go wrong.

Working with distributed systems is also fundamentally different from writing software on a single computer—the main difference being that things can go wrong in lots of new and exciting ways 1, 2]. In this chapter, you will get a taste of the problems that arise in practice and an understanding of what you can and cannot rely on.

To understand the challenges we are up against, we will turn our pessimism up to the maximum and explore the many types of things that may ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119058Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Data-Intensive Applications, 2nd Edition

by Martin Kleppmann, Chris Riccomini

Chapter 9. The Trouble with Distributed Systems

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.