Errata

High Performance Spark

Errata for High Performance Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF Page page 9 (paragraph 'Spark Components')
legend text (#5) at the bottom of the page

Legend says '... Datasets are DataFrames of Row objects...', but actually it's vice-a-versa. And even the caution on page 29 proves that :) '...alias of DataFrame = Dataset[Row] is broken in Java...'

Alex Alex  Jan 04, 2024 
PDF Page 87
Figure 5-1

I think the error is probably a typo but it can really confuse your readers a lot.

In the image about wide dependencies in Figure 5-1, the transformation is a groupByKey. However, we see that the Partition 1 has key with id 3 and Partition 2 also has key with id 3. GroupByKey should group all the same keys in the same partition. However, I think the issue is that Partition 2 should actually have keys 4 and 5, instead of 3 and 5. The key 4 is not grouped into any partitions so that must be a mistake.

Sarin Madarasmi  Feb 13, 2018 
Printed Page 89
Figure 5-1

There is a typo in figure 5-1, where the second partition (from the left) of rdd3 should not contain [3,1] but should contain [4,1].

This correction is a duplicate of one given wrt pdf version

Deborah Siegel  Mar 18, 2018 
Printed Page 127
Entire chapter

The "Goldilocks" example provides a lot of insight.

However, I think it can be solved more efficiently by observing that if we search for the values of the elements ranked i1, i2,...ik in each column, this is equivalent to searching for the elements ranked i1, i2,..ik, N+i1, N+i2,...N+ik, 2N+i1, 2N+i2..2N+ik in a list of all (column_index, value) pairs, if the total number of rows is N.

So the most simple solution would be:
- Make (column_index, value) pairs (as in Goldilocks V2)
- zipWithIndex()
- filter: targetRanks.contains( indexFromZip % N) where targetRanks is a collection of the rank indices i1, i2,...,ik and % is the usual modulo division.

It requires 1 count, 1 sort, 1 zipWithIndex, 1 filter and no requirements on the partitioning. I think this will be one of the most simple and fastest solutions.

If there are invalid elements in the columns, so that not all rows have N elements, it is straightforward to extend the solution by counting valid values per column and making a cumulative list of these start points, then replace the % division above by the offset to the entry in the cumulative list.

[Tried to submit this to the googlegroups email address a month ago but this seems not to be active anymore.]

Claus Brenner  May 02, 2018 
PDF Page 228
exam. 9-13

The code of the example is for training, not for predicting.

It should be replaced to :

def predict(model: LogisticRegressionModel, rdd: RDD[SparkVector]): RDD[Double] = {
model.predict(rdd)
}

Jongyoung Park  Nov 26, 2017 
PDF Page 259
4th paragraph

reduceFunk -> reduceFunc

Jongyoung Park  Dec 24, 2017 
PDF Page 276
table at bottom

In rows of 'spark.executor.memory' and 'spark.executor.cores', Meaning and Default value contents should be exchanged.

Jongyoung Park  Jan 07, 2018 
PDF Page 297
Example A-6

Example code is same as one of example A-5

Jongyoung Park  Jan 21, 2018 
PDF Page 301
first tip

There is no right parenthesis.

Jongyoung Park  Jan 24, 2018 
PDF Page 301
The paragraph right below first 'tip'

log4j.xm -> log4j.xml

Jongyoung Park  Jan 24, 2018