book

Designing Data-Intensive Applications

Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781491903100

by Martin Kleppmann

March 2017

Intermediate to advanced

616 pages

19h 39m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Preface
Who Should Read This Book?Scope of This BookOutline of This BookReferences and Further ReadingO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
Thinking About Data SystemsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsHow Important Is Reliability?ScalabilityDescribing LoadDescribing PerformanceApproaches for Coping with LoadMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
2. Data Models and Query Languages
Relational Model Versus Document ModelThe Birth of NoSQLThe Object-Relational MismatchMany-to-One and Many-to-Many RelationshipsAre Document Databases Repeating History?Relational Versus Document Databases TodayQuery Languages for DataDeclarative Queries on the WebMapReduce QueryingGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple-Stores and SPARQLThe Foundation: DatalogSummary
3. Storage and Retrieval
Data Structures That Power Your DatabaseHash IndexesSSTables and LSM-TreesB-TreesComparing B-Trees and LSM-TreesOther Indexing StructuresTransaction Processing or Analytics?Data WarehousingStars and Snowflakes: Schemas for AnalyticsColumn-Oriented StorageColumn CompressionSort Order in Column StorageWriting to Column-Oriented StorageAggregation: Data Cubes and Materialized ViewsSummary
4. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsThrift and Protocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCMessage-Passing DataflowSummary
II. Distributed Data
5. Replication
Leaders and FollowersSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagReading Your Own WritesMonotonic ReadsConsistent Prefix ReadsSolutions for Replication LagMulti-Leader ReplicationUse Cases for Multi-Leader ReplicationHandling Write ConflictsMulti-Leader Replication TopologiesLeaderless ReplicationWriting to the Database When a Node Is DownLimitations of Quorum ConsistencySloppy Quorums and Hinted HandoffDetecting Concurrent WritesSummary
6. Partitioning
Partitioning and ReplicationPartitioning of Key-Value DataPartitioning by Key RangePartitioning by Hash of KeySkewed Workloads and Relieving Hot SpotsPartitioning and Secondary IndexesPartitioning Secondary Indexes by DocumentPartitioning Secondary Indexes by TermRebalancing PartitionsStrategies for RebalancingOperations: Automatic or Manual RebalancingRequest RoutingParallel Query ExecutionSummary
7. Transactions
The Slippery Concept of a TransactionThe Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase Locking (2PL)Serializable Snapshot Isolation (SSI)Summary

8. The Trouble with Distributed Systems
Faults and Partial FailuresCloud Computing and SupercomputingUnreliable NetworksNetwork Faults in PracticeDetecting FaultsTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Truth Is Defined by the MajorityByzantine FaultsSystem Model and RealitySummary
9. Consistency and Consensus
Consistency GuaranteesLinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityOrdering GuaranteesOrdering and CausalitySequence Number OrderingTotal Order BroadcastDistributed Transactions and ConsensusAtomic Commit and Two-Phase Commit (2PC)Distributed Transactions in PracticeFault-Tolerant ConsensusMembership and Coordination ServicesSummary
III. Derived Data
10. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisThe Unix PhilosophyMapReduce and Distributed FilesystemsMapReduce Job ExecutionReduce-Side Joins and GroupingMap-Side JoinsThe Output of Batch WorkflowsComparing Hadoop to Distributed DatabasesBeyond MapReduceMaterialization of Intermediate StateGraphs and Iterative ProcessingHigh-Level APIs and LanguagesSummary
11. Stream Processing
Transmitting Event StreamsMessaging SystemsPartitioned LogsDatabases and StreamsKeeping Systems in SyncChange Data CaptureEvent SourcingState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
12. The Future of Data Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifyDoing the Right ThingPredictive AnalyticsPrivacy and TrackingSummary
Glossary
Index

Content preview from Designing Data-Intensive Applications

Part II. Distributed Data

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

Richard Feynman, Rogers Commission Report (1986)

In Part I of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in Part II, we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?

There are various reasons why you might want to distribute a database across multiple machines:

Scalability: If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.
Fault tolerance/high availability: If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.
Latency: If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.

Scaling to Higher Load

If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called vertical scaling or scaling up). Many CPUs, many RAM chips, and many disks can be joined together ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Designing Data-Intensive Applications, 2nd Edition

Publisher Resources

ISBN: 9781491903063Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Data-Intensive Applications

by Martin Kleppmann

Part II. Distributed Data

Scaling to Higher Load

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.