book

Designing Data-Intensive Applications

Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781491903100

by Martin Kleppmann

March 2017

Intermediate to advanced

616 pages

19h 39m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Preface
Who Should Read This Book?Scope of This BookOutline of This BookReferences and Further ReadingO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
Thinking About Data SystemsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsHow Important Is Reliability?ScalabilityDescribing LoadDescribing PerformanceApproaches for Coping with LoadMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
2. Data Models and Query Languages
Relational Model Versus Document ModelThe Birth of NoSQLThe Object-Relational MismatchMany-to-One and Many-to-Many RelationshipsAre Document Databases Repeating History?Relational Versus Document Databases TodayQuery Languages for DataDeclarative Queries on the WebMapReduce QueryingGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple-Stores and SPARQLThe Foundation: DatalogSummary
3. Storage and Retrieval
Data Structures That Power Your DatabaseHash IndexesSSTables and LSM-TreesB-TreesComparing B-Trees and LSM-TreesOther Indexing StructuresTransaction Processing or Analytics?Data WarehousingStars and Snowflakes: Schemas for AnalyticsColumn-Oriented StorageColumn CompressionSort Order in Column StorageWriting to Column-Oriented StorageAggregation: Data Cubes and Materialized ViewsSummary
4. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsThrift and Protocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCMessage-Passing DataflowSummary
II. Distributed Data
5. Replication
Leaders and FollowersSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagReading Your Own WritesMonotonic ReadsConsistent Prefix ReadsSolutions for Replication LagMulti-Leader ReplicationUse Cases for Multi-Leader ReplicationHandling Write ConflictsMulti-Leader Replication TopologiesLeaderless ReplicationWriting to the Database When a Node Is DownLimitations of Quorum ConsistencySloppy Quorums and Hinted HandoffDetecting Concurrent WritesSummary
6. Partitioning
Partitioning and ReplicationPartitioning of Key-Value DataPartitioning by Key RangePartitioning by Hash of KeySkewed Workloads and Relieving Hot SpotsPartitioning and Secondary IndexesPartitioning Secondary Indexes by DocumentPartitioning Secondary Indexes by TermRebalancing PartitionsStrategies for RebalancingOperations: Automatic or Manual RebalancingRequest RoutingParallel Query ExecutionSummary
7. Transactions
The Slippery Concept of a TransactionThe Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase Locking (2PL)Serializable Snapshot Isolation (SSI)Summary

8. The Trouble with Distributed Systems
Faults and Partial FailuresCloud Computing and SupercomputingUnreliable NetworksNetwork Faults in PracticeDetecting FaultsTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Truth Is Defined by the MajorityByzantine FaultsSystem Model and RealitySummary
9. Consistency and Consensus
Consistency GuaranteesLinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityOrdering GuaranteesOrdering and CausalitySequence Number OrderingTotal Order BroadcastDistributed Transactions and ConsensusAtomic Commit and Two-Phase Commit (2PC)Distributed Transactions in PracticeFault-Tolerant ConsensusMembership and Coordination ServicesSummary
III. Derived Data
10. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisThe Unix PhilosophyMapReduce and Distributed FilesystemsMapReduce Job ExecutionReduce-Side Joins and GroupingMap-Side JoinsThe Output of Batch WorkflowsComparing Hadoop to Distributed DatabasesBeyond MapReduceMaterialization of Intermediate StateGraphs and Iterative ProcessingHigh-Level APIs and LanguagesSummary
11. Stream Processing
Transmitting Event StreamsMessaging SystemsPartitioned LogsDatabases and StreamsKeeping Systems in SyncChange Data CaptureEvent SourcingState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
12. The Future of Data Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifyDoing the Right ThingPredictive AnalyticsPrivacy and TrackingSummary
Glossary
Index

Content preview from Designing Data-Intensive Applications

Part I. Foundations of Data Systems

The first four chapters go through the fundamental ideas that apply to all data systems, whether running on a single machine or distributed across a cluster of machines:

Chapter 1 introduces the terminology and approach that we’re going to use throughout this book. It examines what we actually mean by words like reliability, scalability, and maintainability, and how we can try to achieve these goals.
Chapter 2 compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations.
Chapter 3 turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance.
Chapter 4 compares various formats for data encoding (serialization) and especially examines how they fare in an environment where application requirements change and schemas need to adapt over time.

Later, Part II will turn to the particular issues of distributed data systems.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Designing Data-Intensive Applications, 2nd Edition

Publisher Resources

ISBN: 9781491903063Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Data-Intensive Applications

by Martin Kleppmann

Part I. Foundations of Data Systems

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.