book

Designing Data-Intensive Applications

Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781491903100

by Martin Kleppmann

March 2017

Intermediate to advanced

616 pages

19h 39m

English

O'Reilly Media, Inc.

Audiobook available

Read now

Unlock full access

Preface
Who Should Read This Book?Scope of This BookOutline of This BookReferences and Further ReadingO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications
Thinking About Data SystemsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsHow Important Is Reliability?ScalabilityDescribing LoadDescribing PerformanceApproaches for Coping with LoadMaintainabilityOperability: Making Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change EasySummary
2. Data Models and Query Languages
Relational Model Versus Document ModelThe Birth of NoSQLThe Object-Relational MismatchMany-to-One and Many-to-Many RelationshipsAre Document Databases Repeating History?Relational Versus Document Databases TodayQuery Languages for DataDeclarative Queries on the WebMapReduce QueryingGraph-Like Data ModelsProperty GraphsThe Cypher Query LanguageGraph Queries in SQLTriple-Stores and SPARQLThe Foundation: DatalogSummary
3. Storage and Retrieval
Data Structures That Power Your DatabaseHash IndexesSSTables and LSM-TreesB-TreesComparing B-Trees and LSM-TreesOther Indexing StructuresTransaction Processing or Analytics?Data WarehousingStars and Snowflakes: Schemas for AnalyticsColumn-Oriented StorageColumn CompressionSort Order in Column StorageWriting to Column-Oriented StorageAggregation: Data Cubes and Materialized ViewsSummary
4. Encoding and Evolution
Formats for Encoding DataLanguage-Specific FormatsJSON, XML, and Binary VariantsThrift and Protocol BuffersAvroThe Merits of SchemasModes of DataflowDataflow Through DatabasesDataflow Through Services: REST and RPCMessage-Passing DataflowSummary
II. Distributed Data
5. Replication
Leaders and FollowersSynchronous Versus Asynchronous ReplicationSetting Up New FollowersHandling Node OutagesImplementation of Replication LogsProblems with Replication LagReading Your Own WritesMonotonic ReadsConsistent Prefix ReadsSolutions for Replication LagMulti-Leader ReplicationUse Cases for Multi-Leader ReplicationHandling Write ConflictsMulti-Leader Replication TopologiesLeaderless ReplicationWriting to the Database When a Node Is DownLimitations of Quorum ConsistencySloppy Quorums and Hinted HandoffDetecting Concurrent WritesSummary
6. Partitioning
Partitioning and ReplicationPartitioning of Key-Value DataPartitioning by Key RangePartitioning by Hash of KeySkewed Workloads and Relieving Hot SpotsPartitioning and Secondary IndexesPartitioning Secondary Indexes by DocumentPartitioning Secondary Indexes by TermRebalancing PartitionsStrategies for RebalancingOperations: Automatic or Manual RebalancingRequest RoutingParallel Query ExecutionSummary
7. Transactions
The Slippery Concept of a TransactionThe Meaning of ACIDSingle-Object and Multi-Object OperationsWeak Isolation LevelsRead CommittedSnapshot Isolation and Repeatable ReadPreventing Lost UpdatesWrite Skew and PhantomsSerializabilityActual Serial ExecutionTwo-Phase Locking (2PL)Serializable Snapshot Isolation (SSI)Summary

8. The Trouble with Distributed Systems
Faults and Partial FailuresCloud Computing and SupercomputingUnreliable NetworksNetwork Faults in PracticeDetecting FaultsTimeouts and Unbounded DelaysSynchronous Versus Asynchronous NetworksUnreliable ClocksMonotonic Versus Time-of-Day ClocksClock Synchronization and AccuracyRelying on Synchronized ClocksProcess PausesKnowledge, Truth, and LiesThe Truth Is Defined by the MajorityByzantine FaultsSystem Model and RealitySummary
9. Consistency and Consensus
Consistency GuaranteesLinearizabilityWhat Makes a System Linearizable?Relying on LinearizabilityImplementing Linearizable SystemsThe Cost of LinearizabilityOrdering GuaranteesOrdering and CausalitySequence Number OrderingTotal Order BroadcastDistributed Transactions and ConsensusAtomic Commit and Two-Phase Commit (2PC)Distributed Transactions in PracticeFault-Tolerant ConsensusMembership and Coordination ServicesSummary
III. Derived Data
10. Batch Processing
Batch Processing with Unix ToolsSimple Log AnalysisThe Unix PhilosophyMapReduce and Distributed FilesystemsMapReduce Job ExecutionReduce-Side Joins and GroupingMap-Side JoinsThe Output of Batch WorkflowsComparing Hadoop to Distributed DatabasesBeyond MapReduceMaterialization of Intermediate StateGraphs and Iterative ProcessingHigh-Level APIs and LanguagesSummary
11. Stream Processing
Transmitting Event StreamsMessaging SystemsPartitioned LogsDatabases and StreamsKeeping Systems in SyncChange Data CaptureEvent SourcingState, Streams, and ImmutabilityProcessing StreamsUses of Stream ProcessingReasoning About TimeStream JoinsFault ToleranceSummary
12. The Future of Data Systems
Data IntegrationCombining Specialized Tools by Deriving DataBatch and Stream ProcessingUnbundling DatabasesComposing Data Storage TechnologiesDesigning Applications Around DataflowObserving Derived StateAiming for CorrectnessThe End-to-End Argument for DatabasesEnforcing ConstraintsTimeliness and IntegrityTrust, but VerifyDoing the Right ThingPredictive AnalyticsPrivacy and TrackingSummary
Glossary
Index

Content preview from Designing Data-Intensive Applications

Chapter 11. Stream Processing

A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.

John Gall, Systemantics (1975)

In Chapter 10 we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of derived data; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recommendation systems, analytics, and more.

However, one big assumption remained throughout Chapter 10: namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option.

In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

Designing Data-Intensive Applications, 2nd Edition

Publisher Resources

ISBN: 9781491903063Errata Page

Designing Data-Intensive Applications

by Martin Kleppmann

Chapter 11. Stream Processing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like