Designing Data-Intensive Applications

Errata for Designing Data-Intensive Applications

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
PDF

stolen book? https://github.com/binhnguyennus/designing-data-intensive-applications

Note from the Author or Editor:
Thank you for bringing this to our attention! We have sent a request to GitHub to have it removed.

Anonymous  Jan 30, 2018  Jan 31, 2018
Safari Books Online
ch04
"Dynamically Generated Schemas", 2nd paragraph

In the text below: [...] problems with textual formats (JSON, CSV, SQL) "SQL" is obviously not a textual format. In the context, the author was probably referring to "XML". The resulting fixed line would be: [...] problems with textual formats (JSON, CSV, XML)

Note from the Author or Editor:
Erratum is correct, I have corrected the text in Atlas

Punleuk Oum  Apr 30, 2018  Jun 01, 2018
Safari Books Online
ch04
"code generation and dynamically typed languages", 3rd paragraph

"[...] code generation is an unnecessarily obstacle to getting to the data." -> "[...] code generation is an unnecessary obstacle to getting to the data."

Note from the Author or Editor:
I have made this change in Atlas

Punleuk Oum  Apr 30, 2018  Jun 01, 2018
Safari Books Online
ch 6
references

Reference [11] Andrew Wang: “Windows Azure Storage,” umbrant.com, February 4, 2016. should link to https://www.umbrant.com/2016/02/04/windows-azure-storage/

Note from the Author or Editor:
URL of the blog post has changed. We're updating it to https://www.umbrant.com/2016/02/04/windows-azure-storage/

David Waller  Oct 01, 2018  Nov 21, 2018
Safari Books Online
Ch 11
references

Reference [18] Jay Kreps, Neha Narkhede, and Jun Rao: “Kafka: A Distributed Messaging System for Log Processing,” is no longer available at that URL. Suggested alternative: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf

Note from the Author or Editor:
I have updated the URL in Atlas and on https://github.com/ept/ddia-references

David Waller  Nov 30, 2018  Mar 15, 2019
PDF
Page x
Top

New types of database [system] (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks s/b systems

Note from the Author or Editor:
Fixed in next Early Release update.

Anonymous  Aug 10, 2015  Mar 01, 2017
Safari Books Online
Chapter 1

In this Chapter 1, we will start by exploring the fundamentals of what we are trying to achieve: reliabile, scalabile and maintainable data systems reliabile -> reliable scalabile -> scalable

Note from the Author or Editor:
Fixed in next Early Release update.

Sascha Gottfried  Sep 23, 2015  Mar 01, 2017
Safari Books Online
Ch 4
Chapter 4, section CODE GENERATION AND DYNAMICALLY TYPED LANGUAGES

compilation is written with a typo as compliation. Code generation is often frowned upon in these languages, since they otherwise avoid an explicit compliation step.

Note from the Author or Editor:
Fixed in next early release update.

Philippe Derome  May 23, 2016  Mar 01, 2017
Safari Books Online
Ch 4

following choice of words feels awkward (unpack): In the rest of this chapter we will unpack some of the most common ways how data flows between processes: It would seem that reveal, show,or describe would be a more common fit than unpack.

Note from the Author or Editor:
Fixed in next early release update.

Philippe Derome  May 23, 2016  Mar 01, 2017
Safari Books Online
5
Chapter 8

In the sub-section "The truth is defined by the majority" of section "Knowledge, Truth and Lies", a typo in the paragraph below figure 8-5: However, the storage server rembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33. "rembers " -> "remembers"?

Note from the Author or Editor:
Fixed in next early release update.

Anonymous  Nov 10, 2016  Mar 01, 2017
PDF
Page 7
2nd Paragraph

"forexample, randomly killing individual processes without warning — is known as chaosmonkey " I don't think that it is correct to use 'chaos monkey' as an umbrella term in this context, chaos monkey is a software application developed to preform that task not a term with that meaning.

Note from the Author or Editor:
Fixed in next Early Release update.

blankenshipz  Jun 30, 2015  Mar 01, 2017
PDF
Page 15
Box 'Percentiles in Practice', First paragraph, last sentence

"As it takes just one slow call to make the entire end-user request slow, rare slow calls to the backend become much more frequent at the end-user request level" should probably be reworded. 1. Saying "the backend" can be misleading, and is technically inaccurate. It's *multiple* backends. It's only with multiple backends that this increase in frequency of slow end-user requests makes sense. 2. It would be more technically accurate and better to say that it is the *collective frequency* of any slow call to any backend that increases. The slow calls themselves do not exactly increase in frequency.

Jeffrey 'jf' Lim  Dec 19, 2015  Mar 01, 2017
PDF
Page 26
2nd paragraph

"data model" should be pluralized in 'There are many different kinds of data model' (should be 'There are many different kinds of data models')

Jeffrey 'jf' Lim  Dec 19, 2015  Mar 01, 2017
Printed
Page 50
1st paragraph

"Besides these, there are also imperative graph query languages such as Gremlin..." I believe that Gremlin supports both imperative and declarative traversals. The wikipedia page is actually a useful reference here: https://en.wikipedia.org/wiki/Gremlin_(programming_language)

Note from the Author or Editor:
Correct, the declarative features seem to have been added to Gremlin since I last looked at it. I will reword this sentence to avoid the confusion.

Jeff Carpenter  Dec 19, 2017  Mar 16, 2018
Safari Books Online
53
Designing Data-Intensive Applications Chapter 2. The Battle of the Data Models

... where the the database ... better written with just one 'the'

Note from the Author or Editor:
Fixed in next Early Release update.

Klaus Ita  Aug 06, 2015  Mar 01, 2017
PDF
Page 53
last paragraph

The following sentence has a typo: "In a graph database, there are is no such restriction: any vertex can have an edge to any other vertex." It should be: "In a graph database, there is no such restriction: any vertex can have an edge to any other vertex."

Slavcho Slavchev  Jan 23, 2016  Mar 01, 2017
PDF
Page 73
last paragraph of this page

"The merging and compaction of frozen segments can be done in a background thread...continue to serve read and write requests as normal, using the old segment files" what does frozen mean is a little vague. How to write old segment files when it has been frozen or close?

Note from the Author or Editor:
Rephrasing this sentence to be clearer in the next update.

yuxh  May 04, 2019 
PDF
Page 76
2nd-3rd paragraphs

TLDR for the below comments: The second and third paragraphs downplay the differences with Bitcask, which was pretty confusing to me at first. Unless I'm mistaken, this sentence ("We also require that each key only appears once within each merged segment file (the compaction process already ensures that).") would be more accurately/helpfully written ("We also require that each key only appears once within each segment file. Incoming keys are consolidated by a tree structure that we will discuss shortly.") This resolves my first confusion, because I thought you were implying that segment files could have multiple entries per key until they were merged. Also, the sentence "At first glance, that requirement seems to break our ability to use sequential writes, but we’ll get to that in a moment" might be more helpfully written "This means that we cannot append new keys directly to the segments (as in Bitcask) because we can only have one entry per key/segment. However, creation of new segment files is still performed using sequential writes, as we will show in a moment." This resolves my second confusion, because I thought you were implying that you would show how all new data could still be written directly to the segments.

Note from the Author or Editor:
Thanks for the suggested wording improvements. In the next update of the book I will tweak the wording to avoid this confusion.

Stephen Dewey  Sep 10, 2018  Nov 21, 2018
Safari Books Online
82
3rd paragraph from the bottom

Swagger is mentioned as a RESTful APIs description language, however this information is not exactly correct nor full. Swagger is an old name (since Nov 5 2015, however still used informally). The current official name is "OpenAPI"(https://www.openapis.org). Swagger is an API documentation tool and even though it is designed for RESTful APIs, it is also used as an interactive documentation tool for other types of HTTP APIs. Moreover Swagger/Open API is not the only tool for API documentation and design. Other popular tools include: * RAML (http://raml.org/) * API Blueprint (https://apiblueprint.org)

Note from the Author or Editor:
Making an appropriate change in QC1 review, to be included in QC2.

Andrzej Jarzyna  Feb 03, 2017  Mar 01, 2017
PDF
Page 87
3rd paragraph

"increasinly" should be "increasingly"

Note from the Author or Editor:
Fixed in next Early Release update.

Greg Nofi  Nov 05, 2015  Mar 01, 2017
PDF
Page 107
line 5

Is it right? "From time to time to time", I think it is mis typo of "From time to time"

Note from the Author or Editor:
Remove spurious "to time".

DaeMyung Kang  Dec 12, 2014  Mar 01, 2017
PDF
Page 137
Under heading

There are two common ways how data is distributed across multiple nodes: /del "how"

Note from the Author or Editor:
Fixed in next Early Release update.

Anonymous  Aug 10, 2015  Mar 01, 2017
PDF
Page 138
"Distributed actor frameworks" section

The "Distributed Actor Frameworks" section is missing important background information. It doesn't really describe why we would want to use such a framework, and it doesn't explain how the frameworks can still be useful despite the potential for lost messages. To make this section useful, I think it would be worth adding a paragraph or two to address these points.

Note from the Author or Editor:
[No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.

Stephen Dewey  Sep 17, 2018  Nov 21, 2018
PDF
Page 190-191
Final paragraph ("However, if you want to allow...")

It would help to address how tombstones help with deletes during concurrent writes (not just how it helps with cleaning up siblings after the fact). In the shopping cart example, if the 4th write was "delete milk, delete eggs, add ham" and a tombstone was added indicating that milk and eggs were deleted at version 4, you would still have milk and eggs coming back in the next write at version 5 (based on version 3). The question then is whether the database assumes that milk and eggs were only included in version 5 because they were part of version 3 (in which case it could delete them now) or whether the database assumes the user was reaffirming that they wanted milk and eggs (in which case the new write should overwrite the tombstone). It doesn't seem like there's an easy answer because there isn't enough information to really know what the intent was.

Note from the Author or Editor:
[No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.

Stephen Dewey  Oct 15, 2018  Nov 21, 2018
Printed
Page 202
2nd paragraph

After figure 6-2, the text states that Volume 12 of the pictured encyclopedia (Trudeau - Zywiec) contains "words starting with T, U, V, X, Y, and Z." However, assuming that the encyclopedia uses the English alphabet, it would also contain words starting with W.

Milo Price  Dec 28, 2017  Mar 16, 2018
Printed
Page 203
5th paragraph

Book states "Cassandra and MongoDB use MD5", Cassandra uses murmur3 hashing though.

Note from the Author or Editor:
Cassandra prior to version 1.2 used MD5, and version 1.2 switched to using Murmur3 by default. I will clarify this in the text.

Ulf Gitschthaler  Jun 26, 2017  Mar 16, 2018
ePub
Page 222
2nd paragraph

"each partitions maintains..." should be "each partition maintains..."

Note from the Author or Editor:
Fixed in next Early Release update.

Anonymous  Oct 29, 2015  Mar 01, 2017
Printed
Page 227
Citation 19

Re: SSDs losing power in just weeks in unusual temps. The citation does say this, but itself refers to a presentation slide that JEDEC has called misunderstood: https://www.jedec.org/news/pressreleases/jedec-update-solid-state-drive-standard While the point is certainly valid that SSDs can lose data in storage, the very short time frames given are talking about EOL'ed enterprise drives. Perhaps a footnote would help for expanding on this alarming statistic. Excellent book by the way, really enjoying it!

Note from the Author or Editor:
Thank you for pointing this out; I will update the wording to clarify this point in the next update of the book.

Corey Sciuto  Aug 26, 2018  Nov 21, 2018
PDF
Page 241
full page prior to "Indexes and snapshot isolation"

If you have the time, I'm wondering if you can shed some light on this. I found the discussion in this section to be very confusing. I think the source of my confusion is that you haven't explained how, if at all, uncommitted rows are kept separate from the list of committed rows in the object version lists that you have shown. Reading between the lines I think the answer is that they are NOT kept separate at all, so an uncommitted write goes immediately into the same list as committed transactions list. Is that true? That would explain why in rule #1 on page 241 you say writes by transactions which were running at the beginning a snapshot transaction are ignored "even if" any of those writes commits. You say "even if" because a transaction could also see the uncommitted writes of earlier transactions that are still running. It needs the list of "transactions that were running when I started" to know that either of the following is true: 1) this transaction is still running and was running when I started, 2) this transaction has committed or aborted (but isn't cleaned up yet) and was running when I started. In both cases it ignores the row. #2 is also confusing because it seems like a superfluous rule after #1 and #3. Does a transaction need some mechanism to determine "rows from aborted transactions" in addition to rules 1+3? The only way I can think to resolve this is that the assumption in my second paragraph is correct (uncommitted rows are not kept separate) and that additionally, it takes some time to clear aborted uncommitted rows from the object version list (and to unmark objects as deleted which were deleted by aborted transactions). Therefore the transaction needs some second list of "aborted transactions which were not cleared when I started" so it can know to ignore them. Is this true? The two paragraphs on page 239 from "To implement snapshot isolation" ending in "for an entire transaction" are also pretty confusing. In the first paragraph you say that it's a generalization of the earlier mechanism (which wasn't fully explained, you just said that "any writes by a transaction only become visible to others when that transaction commits"). But then in the second paragraph you imply that MVCC is effectively a distinct mechanism. But the bigger confusion is with that final sentence ("A typical approach"). Why would it make sense to ever base MVCC on a single query? Even read committed is done at the transaction level, not the query level. Thanks in advance for any clarification you can provide.

Note from the Author or Editor:
[No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book.

Stephen Dewey  Sep 28, 2018  Nov 21, 2018
PDF
Page 242
first two paragraphs

Similar to my earlier question, I think a key missing piece of information here is where this alternative approach places uncommitted writes. It is really hard to understand how this approach is meant to work without knowing that. Also, I think you probably meant to put these two paragraphs in their own section, because they don't have anything to do with the last header ("Indexes and snapshot isolation").

Note from the Author or Editor:
[No change in this edition] We have noted this suggestion and will take it into account when preparing a second edition of the book. The section structure is correct. The last two paragraphs describe the copy-on-write approach to maintaining B-tree indexes, which can help with implementing snapshot isolation by using an old B-tree root as the snapshot from which a transaction reads. We will try to make this clearer in the second edition.

Stephen Dewey  Sep 28, 2018  Nov 21, 2018
Printed
Page 253
2nd paragraph

Under the billeted list outlining developments that caused a rethink: RAM became cheap enough that for many use cases is now feasible to keep.... "is now" should read "it is" or "it's".

Simon McClive  Apr 15, 2017  Mar 16, 2018
PDF
Page 257
second bullet point in the middle

You refer to figure 7-1, but figure 7-1 doesn't portray a case of "reading an old version of an object" as you say. Both reads in that figure happen before any writes occur. Perhaps you meant to refer to a different figure. Also on page 258, remove the "a" from before the word "having".

Note from the Author or Editor:
Changing the reference to Figure 7-4 instead of Figure 7-1.

Stephen Dewey  Oct 03, 2018  Nov 21, 2018
PDF
Page 281
3rd paragraph

"packed-switched" in "Ethernet and IP are packed-switched protocols" should be "packet-switched"

Note from the Author or Editor:
Will be fixed in QC1

Krzysztof Sobusiak  Jan 02, 2017  Mar 01, 2017
PDF
Page 288
3rd-to-last paragraph

"These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time" Based on the reference you linked, it seems the CloudFlare problem was actually that the clock used by its code DID take leap seconds into account, but the application code ignored the fact that this could happen. So perhaps a better phrasing would be: "These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks unsuitable for measuring elapsed time" In other words the problem isn't that time-of-day clocks ignore leap seconds, it's the reverse, that they are affected by them. But then the application code ignores the fact that this can happen.

Note from the Author or Editor:
I agree with the suggested wording change, and have updated the text in Atlas.

Stephen  Dec 03, 2018  Mar 15, 2019
PDF
Page 293
(Sixth Early Release) Ch8, The leader and the lock, 2nd paragraph

Minor problems with plurals: ...even if a nodes believes that it is... [change 'nodes' to 'node'] ... mean the majority of nodes agrees! ... [change 'agrees' to 'agree']

Note from the Author or Editor:
Fixed in next early release update.

Ross  Aug 13, 2016  Mar 01, 2017
PDF
Page 302
(Sixth Early Release) Ch8, Summary, 4th paragraph

The wording feels a bit awkward in - 'The only way how information can flow...' Perhaps drop 'how'? 'The only way information can flow ...'

Note from the Author or Editor:
Fixed in next early release update.

Ross  Aug 13, 2016  Mar 01, 2017
PDF
Page 317
4th paragraph

"...are easier use correctly" should be "...are easier to use correctly"

Note from the Author or Editor:
Fixed in copyedit

Krzysztof Sobusiak  Jan 05, 2017  Mar 01, 2017
Printed
Page 322
2nd full paragraph, 2nd sentence

Unnecessary repeat of word 'first' in same sentence, keeping the first one and suppressing the second one would do: But first we first need to explore the range of guarantees...

Philippe Derome  Apr 25, 2017  Mar 16, 2018
ePub
Page 452
Chapter 8, Figured 8-1

Figure 8-1 seems to be a duplicate of Figure 2-4, and does not match the description of what 8-1 is trying to communicate.

Donald Kjer  Jan 24, 2016  Mar 01, 2017
Mobi
Page 6130

Hard to tell because I use kindle, thus I don't see pages but locations in location 6130 and the first paragraph you see a repeated "to" "When a transaction wants to to commit."

Note from the Author or Editor:
Fixed in next Early Release update.

Wilmer Andres Daza Gomez  Aug 26, 2015  Mar 01, 2017