Errata

Errata for High Performance Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
Other Digital Version	22% Figure 5-1	in this figure a set of partitions is displayed, across 3 different Spark transformations. 1) rdd1 2) rdd2 = rdd1.map(x=(x,1)) 3) rdd3 = rdd2.groupByKey The error consists on referring to rdd3 as "rdd1 child of rdd2" to the left of the image. (Kindle version, location 2344 of 10460) Note from the Author or Editor: The left bottom of figure 1 needs to be updated to "rdd3 child of rdd2" I haven't fixed myself since it's in a figure, but hopefully the production team can fix this if/when we do an update.	Pablo Rodriguez Bertorello	Jul 02, 2017	Oct 20, 2017
PDF	Page 39 table 3-3	In table 3-3, 'gt' of last row should be 'geq' Note from the Author or Editor: Thank you, I've fixed this in atlas.	Jongyoung Park	Jul 28, 2017	Oct 20, 2017
PDF	Page 116 1st paragraph	In the sentence "checkpointing or off_heap persistence or checkpointing", one of two 'checkpoint' should be removed. Note from the Author or Editor: Thank you, I've fixed this in atlas.	Jongyoung Park	Aug 19, 2017	Oct 20, 2017
PDF	Page 121 2nd line in 'LRU caching'	Intead -> Instead Note from the Author or Editor: Thank you, I've fixed this in atlas.	Jongyoung Park	Aug 20, 2017	Oct 20, 2017
PDF	Page 130 2nd paragraph from bottom	'of of' must be 'of' Note from the Author or Editor: I've fixed this in atlas, thank you.	Jongyoung Park	Aug 26, 2017	Oct 20, 2017
PDF	Page 131 TIP	IMO, "an ordering an an object" shold be "an ordering of an object" Note from the Author or Editor: Thank you, I've fixed this in atlas.	Jongyoung Park	Aug 26, 2017	Oct 20, 2017
PDF	Page 161 last paragraph	"(value, column index pairs)" should be "(value, column index) pairs". Note from the Author or Editor: Thank you, I've fixed this in the development copy in atlas.	Jongyoung Park	Sep 07, 2017	Oct 20, 2017
PDF	Page 187 "Installing PySpark" section	1. In the second paragraph, last right parenthesis looks useless. 2. First 'Its' if the third paragraph must be 'It's' or 'It is'.	Jongyoung Park	Sep 18, 2017	Oct 20, 2017
Other Digital Version	2091 Example 4-4	The author says "you can prevent the shuffle [...] and persisting the RDD before the join." However, in Example 4-4, the RDD is not persisted before the join. In addition, the author does not explain the difference between persisting and not persisting, do they really affect the performance of the join? (Kindle version, location 2091 of 10460) Note from the Author or Editor: Thanks! We've already changed the text for this in atlas and it should be included in the next update.	Yong-Siang Shih	Jul 15, 2017	Oct 20, 2017
Other Digital Version	2141 Example 4-5	Although a broadcast variable of smallRDDLocal is created, the the original smallRDDLocal is used. This seems like a mistake as official document points out: After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. Note from the Author or Editor: Thank you, that's correct. I've updated the example on github and it will show up in the updated e-book whenever we next get a chance for a refresh :)	Yong-Siang Shih	Jul 15, 2017	Oct 20, 2017
Other Digital Version	2912 TIP of Example 5-14	The tip says: "calling distinct will cause a shuffle if the partitioner is not known." However, since the distinct function is implemented by map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) Even if a partitioner is known, the map operation does not preserve the partitioner, and therefore a shuffle might be unavoidable? (Kindle version, location 2912 of 10460) Note from the Author or Editor: This is true, a shuffle will occur in either case - however if the partioner is known in advance the reduce step will be able remove all duplicates prior to the shuffle. I've clarified the text for this in our repo (although it may be awhile before this makes it into the kindle version).	Yong-Siang Shih	Jul 15, 2017	Oct 20, 2017
Other Digital Version	3209 Example 5-23	The author claims that by persisting rddA, the "sort stage" will occur only once. This is incorrect. In fact, the "sorted" RDD should be persisted instead. Also, it should be persisted before the count action rather than after that. (Kindle version, location 3209 of 10460) Note from the Author or Editor: The persistence is indeed incorrect in Example 5-23, it should be on sorted before the count is called. I've updated this in the repo, but it may take awhile before the update makes it through.	Yong-Siang Shih	Jul 15, 2017	Oct 20, 2017