Errata

Errata for Designing Data-Intensive Applications

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	ch04 "Dynamically Generated Schemas", 2nd paragraph	In the text below: [...] problems with textual formats (JSON, CSV, SQL) "SQL" is obviously not a textual format. In the context, the author was probably referring to "XML". The resulting fixed line would be: [...] problems with textual formats (JSON, CSV, XML) Note from the Author or Editor: Erratum is correct, I have corrected the text in Atlas	Punleuk Oum	Apr 30, 2018	Jun 01, 2018
	ch04 "code generation and dynamically typed languages", 3rd paragraph	"[...] code generation is an unnecessarily obstacle to getting to the data." -> "[...] code generation is an unnecessary obstacle to getting to the data." Note from the Author or Editor: I have made this change in Atlas	Punleuk Oum	Apr 30, 2018	Jun 01, 2018
	ch 6 references	Reference [11] Andrew Wang: “Windows Azure Storage,” umbrant.com, February 4, 2016. should link to https://www.umbrant.com/2016/02/04/windows-azure-storage/ Note from the Author or Editor: URL of the blog post has changed. We're updating it to https://www.umbrant.com/2016/02/04/windows-azure-storage/	David Waller	Oct 01, 2018	Nov 21, 2018
	Ch 11 references	Reference [18] Jay Kreps, Neha Narkhede, and Jun Rao: “Kafka: A Distributed Messaging System for Log Processing,” is no longer available at that URL. Suggested alternative: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf Note from the Author or Editor: I have updated the URL in Atlas and on https://github.com/ept/ddia-references	David Waller	Nov 30, 2018	Mar 15, 2019
	Page Chapter 3 p. 71, 75	At various places in Chapter 3, the book talks about appending to file being very efficient as compared to in-place update only. p. 71: "... appending to a file is generally very efficient." p. 71 "It's hard to beat the performance of simply appending to a file, ..." p. 75 "Appending ... are sequential write operations." However, the chapter fails to explain the reason why appending is fast. I was thinking that both appending and updating in-place would need single disk-seek. So, both should take same time. On further research, I found that appending to file is efficient because OS would buffer multiple appends and then later write a lot of data at once. It would be helpful to mention this point. Also, good to mention that such buffering would impact durability of data that is in buffer, but not yet written to disk. Note from the Author or Editor: Correct: when there are many append operations, their data is consecutive in the file, and so they can be written out sequentially in fewer I/O operations than if the writes are scattered around many different locations on disk. We will clarify this in the second edition.	Nehchal Jindal	Mar 03, 2023
PDF	Page x Top	New types of database [system] (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks s/b systems Note from the Author or Editor: Fixed in next Early Release update.	Anonymous	Aug 10, 2015	Mar 01, 2017
	Chapter 1	In this Chapter 1, we will start by exploring the fundamentals of what we are trying to achieve: reliabile, scalabile and maintainable data systems reliabile -> reliable scalabile -> scalable Note from the Author or Editor: Fixed in next Early Release update.	Sascha Gottfried	Sep 23, 2015	Mar 01, 2017
	Ch 4 Chapter 4, section CODE GENERATION AND DYNAMICALLY TYPED LANGUAGES	compilation is written with a typo as compliation. Code generation is often frowned upon in these languages, since they otherwise avoid an explicit compliation step. Note from the Author or Editor: Fixed in next early release update.	Philippe Derome	May 23, 2016	Mar 01, 2017
	Ch 4	following choice of words feels awkward (unpack): In the rest of this chapter we will unpack some of the most common ways how data flows between processes: It would seem that reveal, show,or describe would be a more common fit than unpack. Note from the Author or Editor: Fixed in next early release update.	Philippe Derome	May 23, 2016	Mar 01, 2017
	5 Chapter 8	In the sub-section "The truth is defined by the majority" of section "Knowledge, Truth and Lies", a typo in the paragraph below figure 8-5: However, the storage server rembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33. "rembers " -> "remembers"? Note from the Author or Editor: Fixed in next early release update.	Anonymous	Nov 10, 2016	Mar 01, 2017
PDF	Page 7 2nd Paragraph	"forexample, randomly killing individual processes without warning — is known as chaosmonkey " I don't think that it is correct to use 'chaos monkey' as an umbrella term in this context, chaos monkey is a software application developed to preform that task not a term with that meaning. Note from the Author or Editor: Fixed in next Early Release update.	blankenshipz	Jun 30, 2015	Mar 01, 2017
PDF	Page 15 Box 'Percentiles in Practice', First paragraph, last sentence	"As it takes just one slow call to make the entire end-user request slow, rare slow calls to the backend become much more frequent at the end-user request level" should probably be reworded. 1. Saying "the backend" can be misleading, and is technically inaccurate. It's multiple backends. It's only with multiple backends that this increase in frequency of slow end-user requests makes sense. 2. It would be more technically accurate and better to say that it is the collective frequency of any slow call to any backend that increases. The slow calls themselves do not exactly increase in frequency.	Jeffrey 'jf' Lim	Dec 19, 2015	Mar 01, 2017
PDF	Page 26 2nd paragraph	"data model" should be pluralized in 'There are many different kinds of data model' (should be 'There are many different kinds of data models')	Jeffrey 'jf' Lim	Dec 19, 2015	Mar 01, 2017
Printed	Page 50 1st paragraph	"Besides these, there are also imperative graph query languages such as Gremlin..." I believe that Gremlin supports both imperative and declarative traversals. The wikipedia page is actually a useful reference here: https://en.wikipedia.org/wiki/Gremlin_(programming_language) Note from the Author or Editor: Correct, the declarative features seem to have been added to Gremlin since I last looked at it. I will reword this sentence to avoid the confusion.	Jeff Carpenter	Dec 19, 2017	Mar 16, 2018
	53 Designing Data-Intensive Applications Chapter 2. The Battle of the Data Models	... where the the database ... better written with just one 'the' Note from the Author or Editor: Fixed in next Early Release update.	Klaus Ita	Aug 06, 2015	Mar 01, 2017
PDF	Page 53 last paragraph	The following sentence has a typo: "In a graph database, there are is no such restriction: any vertex can have an edge to any other vertex." It should be: "In a graph database, there is no such restriction: any vertex can have an edge to any other vertex."	Slavcho Slavchev	Jan 23, 2016	Mar 01, 2017
PDF	Page 73 last paragraph of this page	"The merging and compaction of frozen segments can be done in a background thread...continue to serve read and write requests as normal, using the old segment files" what does frozen mean is a little vague. How to write old segment files when it has been frozen or close? Note from the Author or Editor: Rephrasing this sentence to be clearer in the next update.	yuxh	May 04, 2019	Aug 09, 2019
PDF	Page 76 2nd-3rd paragraphs	TLDR for the below comments: The second and third paragraphs downplay the differences with Bitcask, which was pretty confusing to me at first. Unless I'm mistaken, this sentence ("We also require that each key only appears once within each merged segment file (the compaction process already ensures that).") would be more accurately/helpfully written ("We also require that each key only appears once within each segment file. Incoming keys are consolidated by a tree structure that we will discuss shortly.") This resolves my first confusion, because I thought you were implying that segment files could have multiple entries per key until they were merged. Also, the sentence "At first glance, that requirement seems to break our ability to use sequential writes, but we’ll get to that in a moment" might be more helpfully written "This means that we cannot append new keys directly to the segments (as in Bitcask) because we can only have one entry per key/segment. However, creation of new segment files is still performed using sequential writes, as we will show in a moment." This resolves my second confusion, because I thought you were implying that you would show how all new data could still be written directly to the segments. Note from the Author or Editor: Thanks for the suggested wording improvements. In the next update of the book I will tweak the wording to avoid this confusion.	Stephen Dewey	Sep 10, 2018	Nov 21, 2018
	82 3rd paragraph from the bottom	Swagger is mentioned as a RESTful APIs description language, however this information is not exactly correct nor full. Swagger is an old name (since Nov 5 2015, however still used informally). The current official name is "OpenAPI"(https://www.openapis.org). Swagger is an API documentation tool and even though it is designed for RESTful APIs, it is also used as an interactive documentation tool for other types of HTTP APIs. Moreover Swagger/Open API is not the only tool for API documentation and design. Other popular tools include: * RAML (http://raml.org/) * API Blueprint (https://apiblueprint.org) Note from the Author or Editor: Making an appropriate change in QC1 review, to be included in QC2.	Andrzej Jarzyna	Feb 03, 2017	Mar 01, 2017
PDF	Page 87 3rd paragraph	"increasinly" should be "increasingly" Note from the Author or Editor: Fixed in next Early Release update.	Greg Nofi	Nov 05, 2015	Mar 01, 2017
	Page 107 of 802 The first paragraph under "B-Trees" heading	This paragraph (and elsewhere in the book) uses the term "log-structured indexes". I initially found this term a bit confusing, as log-structured storage engines often use an in-memory index - ie the index isn't log-structured itself, it is only applied to a log-structured database. IIUC, you perhaps mean "indices used by long-structured storage engines" instead. Note from the Author or Editor: Thanks for highlighting this. I actually meant not the index used internally by a log-structured storage engine, but rather a database index that is implemented using an LSM-tree (as opposed to B-tree) approach. I will reword this in a future revision to avoid this point of confusion.	Apurva Chitnis	Nov 23, 2021
PDF	Page 107 line 5	Is it right? "From time to time to time", I think it is mis typo of "From time to time" Note from the Author or Editor: Remove spurious "to time".	DaeMyung Kang	Dec 12, 2014	Mar 01, 2017
PDF	Page 137 Under heading	There are two common ways how data is distributed across multiple nodes: /del "how" Note from the Author or Editor: Fixed in next Early Release update.	Anonymous	Aug 10, 2015	Mar 01, 2017
PDF	Page 138 "Distributed actor frameworks" section	The "Distributed Actor Frameworks" section is missing important background information. It doesn't really describe why we would want to use such a framework, and it doesn't explain how the frameworks can still be useful despite the potential for lost messages. To make this section useful, I think it would be worth adding a paragraph or two to address these points. Note from the Author or Editor: [No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.	Stephen Dewey	Sep 17, 2018	Nov 21, 2018
PDF	Page 190-191 Final paragraph ("However, if you want to allow...")	It would help to address how tombstones help with deletes during concurrent writes (not just how it helps with cleaning up siblings after the fact). In the shopping cart example, if the 4th write was "delete milk, delete eggs, add ham" and a tombstone was added indicating that milk and eggs were deleted at version 4, you would still have milk and eggs coming back in the next write at version 5 (based on version 3). The question then is whether the database assumes that milk and eggs were only included in version 5 because they were part of version 3 (in which case it could delete them now) or whether the database assumes the user was reaffirming that they wanted milk and eggs (in which case the new write should overwrite the tombstone). It doesn't seem like there's an easy answer because there isn't enough information to really know what the intent was. Note from the Author or Editor: [No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.	Stephen Dewey	Oct 15, 2018	Nov 21, 2018
	Page 195 Ref [28]	The referenced blog post by Robert Hodges has been published on April 30, 2012 but the text reads March instead. Note from the Author or Editor: Correct. Eagle-eyed observation!	Lucio Assis	Apr 12, 2023
Printed	Page 202 2nd paragraph	After figure 6-2, the text states that Volume 12 of the pictured encyclopedia (Trudeau - Zywiec) contains "words starting with T, U, V, X, Y, and Z." However, assuming that the encyclopedia uses the English alphabet, it would also contain words starting with W.	Milo Price	Dec 28, 2017	Mar 16, 2018
Printed	Page 203 5th paragraph	Book states "Cassandra and MongoDB use MD5", Cassandra uses murmur3 hashing though. Note from the Author or Editor: Cassandra prior to version 1.2 used MD5, and version 1.2 switched to using Murmur3 by default. I will clarify this in the text.	Ulf Gitschthaler	Jun 26, 2017	Mar 16, 2018
ePub	Page 222 2nd paragraph	"each partitions maintains..." should be "each partition maintains..." Note from the Author or Editor: Fixed in next Early Release update.	Anonymous	Oct 29, 2015	Mar 01, 2017
Printed	Page 227 Citation 19	Re: SSDs losing power in just weeks in unusual temps. The citation does say this, but itself refers to a presentation slide that JEDEC has called misunderstood: https://www.jedec.org/news/pressreleases/jedec-update-solid-state-drive-standard While the point is certainly valid that SSDs can lose data in storage, the very short time frames given are talking about EOL'ed enterprise drives. Perhaps a footnote would help for expanding on this alarming statistic. Excellent book by the way, really enjoying it! Note from the Author or Editor: Thank you for pointing this out; I will update the wording to clarify this point in the next update of the book.	Corey Sciuto	Aug 26, 2018	Nov 21, 2018
PDF	Page 241 full page prior to "Indexes and snapshot isolation"	If you have the time, I'm wondering if you can shed some light on this. I found the discussion in this section to be very confusing. I think the source of my confusion is that you haven't explained how, if at all, uncommitted rows are kept separate from the list of committed rows in the object version lists that you have shown. Reading between the lines I think the answer is that they are NOT kept separate at all, so an uncommitted write goes immediately into the same list as committed transactions list. Is that true? That would explain why in rule #1 on page 241 you say writes by transactions which were running at the beginning a snapshot transaction are ignored "even if" any of those writes commits. You say "even if" because a transaction could also see the uncommitted writes of earlier transactions that are still running. It needs the list of "transactions that were running when I started" to know that either of the following is true: 1) this transaction is still running and was running when I started, 2) this transaction has committed or aborted (but isn't cleaned up yet) and was running when I started. In both cases it ignores the row. #2 is also confusing because it seems like a superfluous rule after #1 and #3. Does a transaction need some mechanism to determine "rows from aborted transactions" in addition to rules 1+3? The only way I can think to resolve this is that the assumption in my second paragraph is correct (uncommitted rows are not kept separate) and that additionally, it takes some time to clear aborted uncommitted rows from the object version list (and to unmark objects as deleted which were deleted by aborted transactions). Therefore the transaction needs some second list of "aborted transactions which were not cleared when I started" so it can know to ignore them. Is this true? The two paragraphs on page 239 from "To implement snapshot isolation" ending in "for an entire transaction" are also pretty confusing. In the first paragraph you say that it's a generalization of the earlier mechanism (which wasn't fully explained, you just said that "any writes by a transaction only become visible to others when that transaction commits"). But then in the second paragraph you imply that MVCC is effectively a distinct mechanism. But the bigger confusion is with that final sentence ("A typical approach"). Why would it make sense to ever base MVCC on a single query? Even read committed is done at the transaction level, not the query level. Thanks in advance for any clarification you can provide. Note from the Author or Editor: [No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.	Stephen Dewey	Sep 28, 2018	Nov 21, 2018
PDF	Page 242 first two paragraphs	Similar to my earlier question, I think a key missing piece of information here is where this alternative approach places uncommitted writes. It is really hard to understand how this approach is meant to work without knowing that. Also, I think you probably meant to put these two paragraphs in their own section, because they don't have anything to do with the last header ("Indexes and snapshot isolation"). Note from the Author or Editor: [No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book. The section structure is correct. The last two paragraphs describe the copy-on-write approach to maintaining B-tree indexes, which can help with implementing snapshot isolation by using an old B-tree root as the snapshot from which a transaction reads. We will try to make this clearer in the second edition.	Stephen Dewey	Sep 28, 2018	Nov 21, 2018
Printed	Page 249 Entire page	Page 349 appears instead of page 249 on page 249. There is no page 249 content to be found in the book. Page 349 displays correctly, though is duplicated in two places as a result. Note from the Author or Editor: this was a printer error in the 4th printing, but has been corrected since then (6th printing was March 2019).	Anonymous	Jul 01, 2019	Mar 15, 2019
Printed	Page 253 2nd paragraph	Under the billeted list outlining developments that caused a rethink: RAM became cheap enough that for many use cases is now feasible to keep.... "is now" should read "it is" or "it's".	Simon McClive	Apr 15, 2017	Mar 16, 2018
PDF	Page 257 second bullet point in the middle	You refer to figure 7-1, but figure 7-1 doesn't portray a case of "reading an old version of an object" as you say. Both reads in that figure happen before any writes occur. Perhaps you meant to refer to a different figure. Also on page 258, remove the "a" from before the word "having". Note from the Author or Editor: Changing the reference to Figure 7-4 instead of Figure 7-1.	Stephen Dewey	Oct 03, 2018	Nov 21, 2018
PDF	Page 281 3rd paragraph	"packed-switched" in "Ethernet and IP are packed-switched protocols" should be "packet-switched" Note from the Author or Editor: Will be fixed in QC1	Krzysztof Sobusiak	Jan 02, 2017	Mar 01, 2017
PDF	Page 288 3rd-to-last paragraph	"These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time" Based on the reference you linked, it seems the CloudFlare problem was actually that the clock used by its code DID take leap seconds into account, but the application code ignored the fact that this could happen. So perhaps a better phrasing would be: "These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks unsuitable for measuring elapsed time" In other words the problem isn't that time-of-day clocks ignore leap seconds, it's the reverse, that they are affected by them. But then the application code ignores the fact that this can happen. Note from the Author or Editor: I agree with the suggested wording change, and have updated the text in Atlas.	Stephen	Dec 03, 2018	Mar 15, 2019
PDF	Page 293 (Sixth Early Release) Ch8, The leader and the lock, 2nd paragraph	Minor problems with plurals: ...even if a nodes believes that it is... [change 'nodes' to 'node'] ... mean the majority of nodes agrees! ... [change 'agrees' to 'agree'] Note from the Author or Editor: Fixed in next early release update.	Ross	Aug 13, 2016	Mar 01, 2017
PDF	Page 302 (Sixth Early Release) Ch8, Summary, 4th paragraph	The wording feels a bit awkward in - 'The only way how information can flow...' Perhaps drop 'how'? 'The only way information can flow ...' Note from the Author or Editor: Fixed in next early release update.	Ross	Aug 13, 2016	Mar 01, 2017
	Page 305 last paragraph in the middle at the start of the (	it says "i.e. if you have four nodes" but this is an example so it should be "e.g." Note from the Author or Editor: Fixed.	Megan Cutrofello	May 26, 2022
PDF	Page 317 4th paragraph	"...are easier use correctly" should be "...are easier to use correctly" Note from the Author or Editor: Fixed in copyedit	Krzysztof Sobusiak	Jan 05, 2017	Mar 01, 2017
Printed	Page 322 2nd full paragraph, 2nd sentence	Unnecessary repeat of word 'first' in same sentence, keeping the first one and suppressing the second one would do: But first we first need to explore the range of guarantees...	Philippe Derome	Apr 25, 2017	Mar 16, 2018
	Page 358 1st paragraph of "Coordinator failure" section	in the sentence "if any of the prepare requests fail or time out" these verbs should agree with "any" so it should be "if any...fails or times out" (fail -> fails and time -> times), also the next clause needs fail -> fails as well p.s. PLEASE make a form where I can submit multiple errata at once Note from the Author or Editor: Fixed.	Megan Cutrofello	May 26, 2022
	Page 418 Last paragraph	The final paragraph in this section talks about general priority preemption in open-source schedulers (with the caveat "as of this writing", so this errata is mostly addressing that a few things have changed). Pod preemption has existed in Kubernetes since ~2019, see "Pod Priority and Preemption" and "Node Pressure Eviction" in the Kubernetes docs. Other open-source schedulers like Nomad have also included task/job-based preemption as top-level concern since ~2019. Note from the Author or Editor: Correct, this has changed since the first edition came out in 2017. It will be corrected in the second edition	Taylor Chaparro	Jul 02, 2023
	Page 450 first paragraph	you say "We'll discuss a more sophisticated way of freeing disk space later" without a specific section and I think this might be the only time in the entire book that you do something like this so it stood out to me (it's "Log compaction" on p.456) Note from the Author or Editor: Adding a cross-reference as suggested.	Megan Cutrofello	May 26, 2022
ePub	Page 452 Chapter 8, Figured 8-1	Figure 8-1 seems to be a duplicate of Figure 2-4, and does not match the description of what 8-1 is trying to communicate.	Donald Kjer	Jan 24, 2016	Mar 01, 2017
	Page 474 3rd complete paragraph (4th paragraph including the partial one)	"The difference to batch jobs is..." - this is ok in British English but not really in American English, and "the difference from" is ok in both so I'd change it, "to" -> "from" Note from the Author or Editor: Fixed.	Megan Cutrofello	May 26, 2022
	Page 510 Bullet point under 2nd paragraph	This is regarding the section explaining what cas(x, v_old, v_new) => r means. In the penultimate sentence, it mentions "If x ≠ v_old then the operation should leave the register unchanged and return an error.". Ideally x here is the register and it being equal to v_old or not does not make a difference. What it means is that if the value being held by register x is not equal to v_old, then the register should be left unchanged. Note from the Author or Editor: Correct, I was being imprecise with notation here by conflating the register object with its current value. Changing to "If the value of x is different from v_old, then the operation should leave the register unchanged..."	Ankush Sharma	Sep 16, 2023
	Page 541 2nd complete paragraph (3rd including the partial one)	"the personal data it has collected is one of the assets that get sold" -- "gets" should agree with "one" not with "assets" so get -> gets Note from the Author or Editor: Fixed.	Megan Cutrofello	May 26, 2022
Mobi	Page 580 "Because all joins and data dependencies in a workflow..."	Minor insignificant thing, but thought I'd bring it up. Where you say: "Because all joins and data dependencies in a workflow..." It begins with an extra space, at least on the Kindle store edition.	Jorge Israel Peña	Feb 17, 2021	Mar 26, 2021
Mobi	Page 3131 text	TYPO: "commiting the write" should be "committing the write" Redundancy: "The blocking of readers and writers is implemented by a having a lock on each object in the database." Should be: "implemented by having a lock on " Note from the Author or Editor: Fixed in Atlas.	Anonymous	Dec 16, 2019	Jan 24, 2020
Mobi	Page 6130	Hard to tell because I use kindle, thus I don't see pages but locations in location 6130 and the first paragraph you see a repeated "to" "When a transaction wants to to commit." Note from the Author or Editor: Fixed in next Early Release update.	Wilmer Andres Daza Gomez	Aug 26, 2015	Mar 01, 2017
Mobi	Page 10672 throughout	Notes from Amazon Your book has an external links that do not work. For example at the following locations "1789,2763,2780,2816,5030" and throughout the book. For example "Apache CouchDB 1.6." Documentation. Please update a valid external URL.To ensure future access to reference material, Amazon strongly recommends submitting these types of links to an archive service, and including the archived link in the book. If the link is broken due to forces outside your control, it should be deactivated and “[URL inactive]” should be added following the link text." Note from the Author or Editor: I have gone through all URLs in the book and fixed all broken links as of March 2020.	Anonymous	Jan 22, 2020	Mar 27, 2020