book

Designing Data-Intensive Applications

Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781491903100

by Martin Kleppmann

March 2017

Intermediate to advanced

616 pages

19h 39m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Preface
Who Should Read This Book?Scope of This BookOutline of This BookReferences and Further ReadingO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
Thinking About Data SystemsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsHow Important Is Reliability?ScalabilityDescribing LoadDescribing PerformanceApproaches for Coping with LoadMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
2. Data Models and Query Languages
Relational Model Versus Document ModelThe Birth of NoSQLThe Object-Relational MismatchMany-to-One and Many-to-Many RelationshipsAre Document Databases Repeating History?Relational Versus Document Databases TodayQuery Languages for DataDeclarative Queries on the WebMapReduce QueryingGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple-Stores and SPARQLThe Foundation: DatalogSummary
3. Storage and Retrieval
Data Structures That Power Your DatabaseHash IndexesSSTables and LSM-TreesB-TreesComparing B-Trees and LSM-TreesOther Indexing StructuresTransaction Processing or Analytics?Data WarehousingStars and Snowflakes: Schemas for AnalyticsColumn-Oriented StorageColumn CompressionSort Order in Column StorageWriting to Column-Oriented StorageAggregation: Data Cubes and Materialized ViewsSummary
4. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsThrift and Protocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCMessage-Passing DataflowSummary
II. Distributed Data
5. Replication
Leaders and FollowersSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagReading Your Own WritesMonotonic ReadsConsistent Prefix ReadsSolutions for Replication LagMulti-Leader ReplicationUse Cases for Multi-Leader ReplicationHandling Write ConflictsMulti-Leader Replication TopologiesLeaderless ReplicationWriting to the Database When a Node Is DownLimitations of Quorum ConsistencySloppy Quorums and Hinted HandoffDetecting Concurrent WritesSummary
6. Partitioning
Partitioning and ReplicationPartitioning of Key-Value DataPartitioning by Key RangePartitioning by Hash of KeySkewed Workloads and Relieving Hot SpotsPartitioning and Secondary IndexesPartitioning Secondary Indexes by DocumentPartitioning Secondary Indexes by TermRebalancing PartitionsStrategies for RebalancingOperations: Automatic or Manual RebalancingRequest RoutingParallel Query ExecutionSummary
7. Transactions
The Slippery Concept of a TransactionThe Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase Locking (2PL)Serializable Snapshot Isolation (SSI)Summary

8. The Trouble with Distributed Systems
Faults and Partial FailuresCloud Computing and SupercomputingUnreliable NetworksNetwork Faults in PracticeDetecting FaultsTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Truth Is Defined by the MajorityByzantine FaultsSystem Model and RealitySummary
9. Consistency and Consensus
Consistency GuaranteesLinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityOrdering GuaranteesOrdering and CausalitySequence Number OrderingTotal Order BroadcastDistributed Transactions and ConsensusAtomic Commit and Two-Phase Commit (2PC)Distributed Transactions in PracticeFault-Tolerant ConsensusMembership and Coordination ServicesSummary
III. Derived Data
10. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisThe Unix PhilosophyMapReduce and Distributed FilesystemsMapReduce Job ExecutionReduce-Side Joins and GroupingMap-Side JoinsThe Output of Batch WorkflowsComparing Hadoop to Distributed DatabasesBeyond MapReduceMaterialization of Intermediate StateGraphs and Iterative ProcessingHigh-Level APIs and LanguagesSummary
11. Stream Processing
Transmitting Event StreamsMessaging SystemsPartitioned LogsDatabases and StreamsKeeping Systems in SyncChange Data CaptureEvent SourcingState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
12. The Future of Data Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifyDoing the Right ThingPredictive AnalyticsPrivacy and TrackingSummary
Glossary
Index

Content preview from Designing Data-Intensive Applications

Part III. Derived Data

In Parts I and II of this book, we assembled from the ground up all the major considerations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application.

In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.

In this final part of the book, we will examine the issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns, into one coherent application architecture. This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.

Systems of Record and Derived Data

On a high level, systems that store and process data can be grouped into two broad categories:

Systems of record: A system of record, also known as source of truth, holds the authoritative version ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Designing Data-Intensive Applications, 2nd Edition

Publisher Resources

ISBN: 9781491903063Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Data-Intensive Applications

by Martin Kleppmann

Part III. Derived Data

Systems of Record and Derived Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.