Errata

Stream Processing with Apache Spark

Errata for Stream Processing with Apache Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
NA
CH2

"the oldest timestamp that we will accept on the data stream"

The word old here could be misleading, on first viewing I could read this as saying "the earliest timestamp that we will accept on the data stream".

"is usually much older than the average timestamp of the latest few elements."

The word oldest here seems potentially confusing.
I'd also be wondering what you mean by latest few elements.

"Once this notion of watermark is defines, the stream processor can classify its output in two categories: Either it is producing output relative to events that are all older than the watermark, in which case the output is final because all of those elements have been observed so far, and no further event older than that will ever be considered, or it is producing an output relative to the data that is before the watermark and a new delayed element newer than the watermark could arrive on the stream at any moment and can change the outcome. In this latter case, we can consider the output as provisional, as newer data can still change the final outcome while in the former case, the result is final and no new data will be able to change it."

This paragraph seems un-necessarily difficult to parse.

"That is, we can only expect the results of a streaming computation based on event-time processing to be meaningful if the watermark allows for the delays that messages of our stream will actually encounter between their creation-time and their order of arrival on the input data stream.In summary, this only works if the watermark allows to catch all delayed events."

Might be worth clarifying this. You've just shown a fixed delay in terms of the watermark and we know this can't guarantee it'll ensure all messages are handled before the watermark.

Note from the Author or Editor:
Thanks for your suggestions. They have been incorporated in the draft.

Colin Jack  May 05, 2019 
NA
CH2

"In fact, Structured Streaming pushes this logic by considering a data stream as a virtual table of records, where each row corresponds to an element....Whether streams are viewed as a continuously extended collection or as a table, this approach gives us some insight on the kind of computation we may find interesting"

You haven't actually described what you mean in much detail and don't elaborate on it in the first few paragraphs of the "Stateful Streams" section. In addition the first paragraph in the "Stateful Streams" section feels like it could be broken up a bit.

Note from the Author or Editor:
Suggestion accepted and integrated into a new version.

Colin Jack  May 02, 2019 
NA
CH2, sliding windows section

1. "drastical" should just be "drastic".
2. "and important characteristic" should be "an important characteristic", also
3. I'm not sure the first paragraph in this section is particularly clear in general, in fact omitting it might be better as the second paragraph explains what a sliding window is and why you might want it very clearly.

Note from the Author or Editor:
Suggestion accepted.

Colin Jack  May 02, 2019 
NA
CH2

"Tumbling Windows are the norm when we require to aggregate our data evenly over a period of time, independently from previous periods."

I think the word "evenly" here could be misleading, also I think it might be clearer if it ended ", with each period independent from previous periods."

Note from the Author or Editor:
Thanks for the suggestion. It has been incorporated in the draft.

Colin Jack  May 02, 2019 
NA
CH1

Silly thing but in the text you mention "Spark core engine" but this isn't the text in the associated box in Figure 1-1.

Note from the Author or Editor:
Thanks for the feedback. This has been addressed in a new draft.

Colin Jack  May 02, 2019 
NA
CH1

"can assume to have access to the complete dataset"

Note from the Author or Editor:
Thanks for your feedback, Colin.
I assume that this comment relates to the complexity of the sentence. We are addressing this in an update of the text.

Colin Jack  May 02, 2019 
NA
CH1

"For any data-carrying executor that may crash, the data it contained when it does crash has a copy on another machine.As a result, it is enough to re-launch the task that was running on this executor."

...could be rewritten as...

"All data is replicated so if a data-carrying executor crashes it is enough to relaunch the task that was running on the crashed executor."

Note from the Author or Editor:
Thanks for the feedback. This has been addressed in a new draft.

Colin Jack  May 02, 2019