book

Designing Data-Intensive Applications

Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781491903100

by Martin Kleppmann

March 2017

Intermediate to advanced

616 pages

19h 39m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Preface
Who Should Read This Book?Scope of This BookOutline of This BookReferences and Further ReadingO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
Thinking About Data SystemsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsHow Important Is Reliability?ScalabilityDescribing LoadDescribing PerformanceApproaches for Coping with LoadMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
2. Data Models and Query Languages
Relational Model Versus Document ModelThe Birth of NoSQLThe Object-Relational MismatchMany-to-One and Many-to-Many RelationshipsAre Document Databases Repeating History?Relational Versus Document Databases TodayQuery Languages for DataDeclarative Queries on the WebMapReduce QueryingGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple-Stores and SPARQLThe Foundation: DatalogSummary
3. Storage and Retrieval
Data Structures That Power Your DatabaseHash IndexesSSTables and LSM-TreesB-TreesComparing B-Trees and LSM-TreesOther Indexing StructuresTransaction Processing or Analytics?Data WarehousingStars and Snowflakes: Schemas for AnalyticsColumn-Oriented StorageColumn CompressionSort Order in Column StorageWriting to Column-Oriented StorageAggregation: Data Cubes and Materialized ViewsSummary
4. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsThrift and Protocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCMessage-Passing DataflowSummary
II. Distributed Data
5. Replication
Leaders and FollowersSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagReading Your Own WritesMonotonic ReadsConsistent Prefix ReadsSolutions for Replication LagMulti-Leader ReplicationUse Cases for Multi-Leader ReplicationHandling Write ConflictsMulti-Leader Replication TopologiesLeaderless ReplicationWriting to the Database When a Node Is DownLimitations of Quorum ConsistencySloppy Quorums and Hinted HandoffDetecting Concurrent WritesSummary
6. Partitioning
Partitioning and ReplicationPartitioning of Key-Value DataPartitioning by Key RangePartitioning by Hash of KeySkewed Workloads and Relieving Hot SpotsPartitioning and Secondary IndexesPartitioning Secondary Indexes by DocumentPartitioning Secondary Indexes by TermRebalancing PartitionsStrategies for RebalancingOperations: Automatic or Manual RebalancingRequest RoutingParallel Query ExecutionSummary
7. Transactions
The Slippery Concept of a TransactionThe Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase Locking (2PL)Serializable Snapshot Isolation (SSI)Summary

8. The Trouble with Distributed Systems
Faults and Partial FailuresCloud Computing and SupercomputingUnreliable NetworksNetwork Faults in PracticeDetecting FaultsTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Truth Is Defined by the MajorityByzantine FaultsSystem Model and RealitySummary
9. Consistency and Consensus
Consistency GuaranteesLinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityOrdering GuaranteesOrdering and CausalitySequence Number OrderingTotal Order BroadcastDistributed Transactions and ConsensusAtomic Commit and Two-Phase Commit (2PC)Distributed Transactions in PracticeFault-Tolerant ConsensusMembership and Coordination ServicesSummary
III. Derived Data
10. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisThe Unix PhilosophyMapReduce and Distributed FilesystemsMapReduce Job ExecutionReduce-Side Joins and GroupingMap-Side JoinsThe Output of Batch WorkflowsComparing Hadoop to Distributed DatabasesBeyond MapReduceMaterialization of Intermediate StateGraphs and Iterative ProcessingHigh-Level APIs and LanguagesSummary
11. Stream Processing
Transmitting Event StreamsMessaging SystemsPartitioned LogsDatabases and StreamsKeeping Systems in SyncChange Data CaptureEvent SourcingState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
12. The Future of Data Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifyDoing the Right ThingPredictive AnalyticsPrivacy and TrackingSummary
Glossary
Index

Content preview from Designing Data-Intensive Applications

Preface

If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzzwords relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!

In the last decade we have seen many interesting developments in databases, in distributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments:

Internet companies such as Google, Microsoft, Amazon, Facebook, LinkedIn, Netflix, and Twitter are handling huge volumes of data and traffic, forcing them to create new tools that enable them to efficiently handle such scale.
Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and data models flexible.
Free and open source software has become very successful and is now preferred to commercial or bespoke in-house software in many environments.
CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.
Even if you work on a small team, you can now build systems that are distributed across many machines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) such as Amazon Web Services.
Many services are now expected to be ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Designing Data-Intensive Applications, 2nd Edition

Publisher Resources

ISBN: 9781491903063Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Designing Data-Intensive Applications

by Martin Kleppmann

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.