Errata

Errata for High Performance Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
PDF	Page page 9 (paragraph 'Spark Components') legend text (#5) at the bottom of the page	Legend says '... Datasets are DataFrames of Row objects...', but actually it's vice-a-versa. And even the caution on page 29 proves that :) '...alias of DataFrame = Dataset[Row] is broken in Java...'	Alex Alex	Jan 04, 2024
PDF	Page 87 Figure 5-1	I think the error is probably a typo but it can really confuse your readers a lot. In the image about wide dependencies in Figure 5-1, the transformation is a groupByKey. However, we see that the Partition 1 has key with id 3 and Partition 2 also has key with id 3. GroupByKey should group all the same keys in the same partition. However, I think the issue is that Partition 2 should actually have keys 4 and 5, instead of 3 and 5. The key 4 is not grouped into any partitions so that must be a mistake.	Sarin Madarasmi	Feb 13, 2018
Printed	Page 89 Figure 5-1	There is a typo in figure 5-1, where the second partition (from the left) of rdd3 should not contain [3,1] but should contain [4,1]. This correction is a duplicate of one given wrt pdf version	Deborah Siegel	Mar 18, 2018
Printed	Page 127 Entire chapter	The "Goldilocks" example provides a lot of insight. However, I think it can be solved more efficiently by observing that if we search for the values of the elements ranked i1, i2,...ik in each column, this is equivalent to searching for the elements ranked i1, i2,..ik, N+i1, N+i2,...N+ik, 2N+i1, 2N+i2..2N+ik in a list of all (column_index, value) pairs, if the total number of rows is N. So the most simple solution would be: - Make (column_index, value) pairs (as in Goldilocks V2) - zipWithIndex() - filter: targetRanks.contains( indexFromZip % N) where targetRanks is a collection of the rank indices i1, i2,...,ik and % is the usual modulo division. It requires 1 count, 1 sort, 1 zipWithIndex, 1 filter and no requirements on the partitioning. I think this will be one of the most simple and fastest solutions. If there are invalid elements in the columns, so that not all rows have N elements, it is straightforward to extend the solution by counting valid values per column and making a cumulative list of these start points, then replace the % division above by the offset to the entry in the cumulative list. [Tried to submit this to the googlegroups email address a month ago but this seems not to be active anymore.]	Claus Brenner	May 02, 2018
PDF	Page 228 exam. 9-13	The code of the example is for training, not for predicting. It should be replaced to : def predict(model: LogisticRegressionModel, rdd: RDD[SparkVector]): RDD[Double] = { model.predict(rdd) }	Jongyoung Park	Nov 26, 2017
PDF	Page 259 4th paragraph	reduceFunk -> reduceFunc	Jongyoung Park	Dec 24, 2017
PDF	Page 276 table at bottom	In rows of 'spark.executor.memory' and 'spark.executor.cores', Meaning and Default value contents should be exchanged.	Jongyoung Park	Jan 07, 2018
PDF	Page 297 Example A-6	Example code is same as one of example A-5	Jongyoung Park	Jan 21, 2018
PDF	Page 301 first tip	There is no right parenthesis.	Jongyoung Park	Jan 24, 2018
PDF	Page 301 The paragraph right below first 'tip'	log4j.xm -> log4j.xml	Jongyoung Park	Jan 24, 2018