Errata

Errata for Spark: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
Printed	Page 155 Last paragraph, 3rd sentence	"format is optional because by default, Spark will use the arquet format." should read "format is optional because by default, Spark will use the parquet format.". Note from the Author or Editor: This fix is correc!	Anonymous	Jan 19, 2019
Printed	Page 194 last set of SQL code on page	SELECT * FROM flights WHERE origin_country_name IN (SELECT dest_country_name FROM flights GROUP BY dest_country_name ORDER BY sum(count) DESC LIMIT 5) should actually be: SELECT * FROM flights WHERE origin_country_name IN (SELECT dest_country_name FROM flights GROUP BY dest_country_name ORDER BY sum(count) DESC) LIMIT 5 i.e. right parenthesis is in wrong place. Note from the Author or Editor: Please fix this, as described!	Jonathan Wharton	Jan 15, 2019
Printed	Page 131 first line of python code	The piece of code should clear nulls, but the .na has not been included. the line: dfNoNull = dfWithDate.drop() should be: dfNoNull = dfWithDate.na.drop() Note from the Author or Editor: Think this might have been fixed already but if not, please fix it.	Jonathan Wharton	Jan 12, 2019
	129 SQL query of subDistinct	sumDistinct example in SQL format require correction. SELECT sum( distinct Quantity) FROM dfTable Note from the Author or Editor: This is correct, however it's on page 122. It should be fixed there. Searching for "SELECT sum(Quantity) FROM dfTable" will show you the right location.	Amit Kumar	Nov 15, 2018
Printed	Page 122 sumDistinct code block	The SQL statement for sumDistinct is not correct as the DISTINCT keyword is missing, it should be scala> spark.sql("""SELECT sum(DISTINCT Quantity) FROM dfTable""").show() +----------------------+ \|sum(DISTINCT Quantity)\| +----------------------+ \| 29310\| +----------------------+ Note from the Author or Editor: You're correct, we need to add the DISTINCT keyword to that SQL statement under the sumDistinct heading. It should state SELECT SUM(DISTINCT Quantity) FROM dfTable -- 29310 instead of SELECT SUM(Quantity) FROM dfTable -- 29310	Tom Geudens	Apr 24, 2018
Printed	Page 90 second code block	The describe method will actually compute statistics on almost any column, not just numeric ones. The df.describe.show() also shows results for Country and Descripition (string), but not for the InvoiceDate (timestamp). This is also reflected if you select this columns : scala> df.select("Description").describe().show() +-------+--------------------+ \|summary\| Description\| +-------+--------------------+ \| count\| 3098\| \| mean\| null\| \| stddev\| null\| \| min\| 4 PURPLE FLOCK D...\| \| max\|ZINC WILLIE WINKI...\| +-------+--------------------+ scala> df.select("InvoiceDate").describe().show() +-------+ \|summary\| +-------+ \| count\| \| mean\| \| stddev\| \| min\| \| max\| +-------+ Note from the Author or Editor: This may have been a more recent change because what was displayed was what shown for me when I ran the code. In the paragraph before, let's change "all numeric columns" to just say "relevant columns". Also, after the following sentence "This will take all numeric columns and calculate the count, mean, standard deviation, min, and max." Let's add: "This schema may change over time as new types are supported, don't depend too heavily on this schema (or behavior)."	Tom Geudens	Apr 17, 2018
Printed	Page 74 Changing a Column's Type (cast)	The count-column is actually already of the LongType (which you show on page 60). So it may make more sense to cast("integer"). Note from the Author or Editor: Nice catch. I think to make this even more clear, we should change the code block and that paragraph. Let's change: For instance, let’s convert our count column from an `Integer` to a `String`: df.withColumn("count2", col("count").cast("string")) -- in SQL SELECT *, cast(count as string) AS count2 FROM dfTable	Tom Geudens	Apr 15, 2018
Printed	Page 20 3rd paragraph (Lazy Evaluation section)	In the start of the paragraph "Lazy evaulation" the word "evaluation" has a typo. Note from the Author or Editor: Yup, it does! Please correct the spelling.	Sertan Şentürk	Apr 13, 2018
Printed	Page 37 middle code block	The code blocks for both Scala and Python define a purchaseByCustomerPerHour. Which is very specific, but the window function used states window(col("InvoiceDate"), "1 day"). Now I'm not a specialist on the Spark function-set yet, but based on what I read there I would say it should be PerDay and not PerHour ? Also, using col("InvoiceDate") in one example and $"InvoiceDate" in the next without explanation is confusing (sure, they both probably mean the same, but this is page 37 ... we're not specialists yet). Note from the Author or Editor: InvoiceDate is a timestamp column so per hour is correct (but completely understand where you're coming from). As for the dollar signs, you're right - we talk about those in a later chapter but should probably properly introduce them. Sorry about that. We'll change them to col("InvoiceDate") to help with a bit more clarity at this point.	Tom Geudens	Apr 10, 2018
Printed	Page 35 Scala code at the bottom	The code is missing a "sort descending". It is implied this was present at some point, both from the import and from the results on the next page (which you only get if you apply a sort), but it is no longer in either the Scala or the Python code. The code should be this : staticDataFrame .selectExpr( "CustomerID", "(UnitPrice * Quantity) as total_cost", "InvoiceDate") .groupBy( col("CustomerID"), window(col("InvoiceDate"), "1 day")) .sum("total_cost") .withColumnRenamed("sum(total_cost)","daily_total") .sort(desc("daily_total")) .show(5) Note from the Author or Editor: Re-reading, I'm not sure exactly where the sort should be. I see your point but don't think it's 100% necessary for the point that we're getting across. I think we should just remove the sort from the import statements. import org.apache.spark.sql.functions.{window, column, desc, col} should be come import org.apache.spark.sql.functions.{window, col}	Tom Geudens	Apr 10, 2018
Printed	Page 518 1st paragraph, 3rd sentence	'[...] combine motif finding with DataFarme queries [...]' should read '[...] combine motif finding with DataFrame queries [...]' Note from the Author or Editor: You are correct! We will make this change!	Elias Strehle	Apr 04, 2018
Printed	Page 462 Subsection 'Multilabel Classification', 4th sentence	'Another example of multilabel classification is identifying the number of objects that appear in an image.' This is not true: Predicting the number of objects is neither a multilabel problem (since only one number is predicted for an image) nor a classification problem (since there are infinitely many possible values). The sentence could be replaced by the following: 'Another example of multilabel classification is identifying the objects that appear in an image.' Note from the Author or Editor: The wording is a bit imprecise and I agree with your proposed correction.	Elias Strehle	Apr 03, 2018
Printed	Page 437 Subsection 'Advanced bucketing techniques', 1st sentence	'descriubed' should read 'described' Note from the Author or Editor: It should!	Elias Strehle	Apr 03, 2018
Printed	Page 381 General note	'[...] output of the dream [...]' is a lovely metaphor, but should probably read '[...] output of the stream [...]' Note from the Author or Editor: I almost want to leave it because it makes me smile. But yes, we should change this.	Elias Strehle	Apr 03, 2018
Printed	Page 378 Section 'Arbitrary Stateful Processing', 1st sentence	'The first section if this chapter [...]' should read 'The first section of this chapter [...]' Note from the Author or Editor: Indeed!	Elias Strehle	Apr 03, 2018
Printed	Page 372 2nd paragraph, code block	The code block ' spark.sql("SELECT * FROM events_per_window").printSchema() SELECT * FROM events_per_window ' contains two minor errors: 1) It should be '.show()' instead of '.printSchema()' to be consistent with the 3rd paragraph. 2) For Python, the code should reference 'pyevents_per_window' instead of 'events_per_window'. Note from the Author or Editor: Yes it should be ".show" I agree with 1). However, for 2), we had to reduce the number of code blocks. It is fine as is and we hope readers will change it accordingly.	Elias Strehle	Apr 03, 2018
Printed	Page 342 1st paragraph, 4th sentence	'[...] (all of its the windowing operators [...]' should read '[...] (all of its windowing operators [...]' or '[...] (all of the windowing operators [...]' Note from the Author or Editor: Let's change to "all of its windowing operators"	Elias Strehle	Apr 03, 2018
Printed	Page 339 1st paragraph, 2nd sentence	'[...] require deep expertise to be develop and maintain.' should read '[...] require deep expertise to be developed and maintained.' Note from the Author or Editor: Yes, let's make this change.	Elias Strehle	Apr 03, 2018
Printed	Page 336 Subsection 'Real-time decision making', 2nd sentence	The last word 'fradulent' should read 'fraudulent' Note from the Author or Editor: Yes, it should!	Elias Strehle	Apr 03, 2018
Printed	Page 276 1st and 2nd paragraph	The code block should be below the 2nd paragraph, not above, so the last sentence 'The example that follows [...]' becomes correct Note from the Author or Editor: Please change this to "The previous example configures..."	Elias Strehle	Mar 29, 2018
Printed	Page 272 3rd paragraph, 1st sentence	'When submitting applciations, [...]' should read 'When submitting applications, [...]' Note from the Author or Editor: Yes, please fix.	Elias Strehle	Mar 29, 2018
Printed	Page 257 Info box, 6th sentence	'communtiy' should read 'community' Note from the Author or Editor: Yes, please fix.	Elias Strehle	Mar 29, 2018
Printed	Page 256 2nd paragraph, last word	'Appication' should read 'Application' Note from the Author or Editor: Yes, it should!	Elias Strehle	Mar 29, 2018
Printed	Page 245 1st paragraph of subsection 'Custom Accumulators'	'In this example, you we will add [...]' should contain either 'you' or 'we', not both Note from the Author or Editor: Thanks for this feedback, we'll make the change!	Elias Strehle	Mar 28, 2018
Printed	Page 229 First paragraph in section 'Understanding Aggregation Implementations'	'We'll do these in the context of a key, but the same basic principles apply to the groupBy and reduce methods' should read 'We'll do these in the context of a key, but the same basic principles apply to the groupByValue and reduceValue methods' Note from the Author or Editor: This is probably a fair criticism if we're referring explicitly to the "method calls" instead of just the "method of implementation". We should clean these up to make sure they're consistent and your rewrite is probably a great start.	Elias Strehle	Mar 28, 2018
Printed	Page 212 Last sentence	'You get the both of best worlds.' I think the incorrect is order. Note from the Author or Editor: Absolutely. :)	Elias Strehle	Mar 28, 2018
Printed	Page 102 Last paragraph, 5th sentence	'When we declare [...] not having a null time [...]' should read 'When we declare [...] not having a null type [...]' Note from the Author or Editor: We should make this change.	Elias Strehle	Mar 28, 2018
Printed	Page 98 1st sentence after code block	'Although Spark will do read dates or times on a best-effort basis' should read 'Spark will read dates or times on a best-effort basis' Note from the Author or Editor: Read/do should be "parse" in the future. This is good feedback.	Elias Strehle	Mar 28, 2018
Printed	Page 97 7th line in code block	The 7th line in the '# in python' code block at the top of the page contains an undefined variable 'c'. This should be 'color_string' instead: '.alias("is_" + color_string)' Note from the Author or Editor: Yes, you are correct! We should make this change.	Elias Strehle	Mar 28, 2018
Printed	Page 44 Last two lines	'The only difference will by syntax' should read 'The only difference will be syntax' Note from the Author or Editor: Yes, this is correct. We should change this.	Elias Strehle	Mar 28, 2018
Printed	Page 402 3rd Paragraph	The sentence, "O'Reilly should we link to or mention any specific ones?" is left in the text. Note from the Author or Editor: Yes, we should remove this sentence.	Anonymous	Mar 27, 2018
PDF,	Page 19 Chapter 1, paragraph 3	Last word of the paragraph contains a typo: "langauge". It should be "language": "Spark Core consists of two APIs. The Unstructured and Structured APIs. The Unstructured API is Spark’s lower level set of APIs including Resilient Distributed Datasets (RDDs), Accumulators, and Broadcast variables. The Structured API consists of DataFrames, Datasets, Spark SQL and is the interface that most users should use. The difference between the two is that one is optimized to work with structured data in a spreadsheet-like interface while the other is meant for manipulation of raw java objects. Outside of Spark Core sit a variety of tools, libraries, and languages like MLlib for performing machine learning, the GraphX module for performing graph processing, and SparkR for working with Spark clusters from the R langauge." Note from the Author or Editor: Fixed for first release.	Anonymous	Jan 18, 2018	Feb 08, 2018
	I "Who This Book is For" section	There is a typo of "efficienly". The correct word is "efficiently". Note from the Author or Editor: Fixed for first release.	Keiji Yoshida	Jan 12, 2018	Feb 08, 2018
PDF,	Page cover	The cover of the 1st edition still says it's an "Early Release".	Harald Gegenfurtner	Dec 31, 2017	Feb 08, 2018
	na Chapter 5, Section: Aggregating to complex types	repeated. A cube takes the rollup takes a rollup to a level deeper. Note from the Author or Editor: Fixed for first release.	Emmanuel Asimadi	Nov 22, 2017	Feb 08, 2018
	NA Chapter 3, Subsection "Creating Dataframes"	Probably should be "encounter" instead of "encourage". "With these three tools, you should be able to solve the vast majority of transformation challenges that you may encourage in DataFrames." Note from the Author or Editor: Fixed for first release.	Emmanuel Asimadi	Nov 18, 2017	Feb 08, 2018
	NA Chapter 3, Subsection "Creating Row"	The return type for below should be Int instead of string. myRow.getInt(2) // String Note from the Author or Editor: Fixed for first release.	Emmanuel Asimadi	Nov 18, 2017	Feb 08, 2018
	NA subsection Columns, 2nd Paragraph	The seems to be a typo in Chapter 3 subsection Columns, Paragraph 2....."this column may or may not exist in our of our DataFrames." probably should be "this column may or may not exist in our DataFrames." instead. Note from the Author or Editor: Fixed for first release.	Emmanuel Asimadi	Nov 18, 2017	Feb 08, 2018
PDF,	chapter 1, 8th paragraph	Chapter 1: Paragraph 8: This is from ebook * "The last piece relevant piece for us is the cluster manager." Looks like grammer mistake piece repeated twice. * Typo on "appliications" Note from the Author or Editor: Fixed for first release	Saad Khawaja	Oct 17, 2017	Feb 08, 2018
	1 Chapter 16	chapter 15 and chapter 16 have the same content on Safari Books Online early release https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ Here is chapter 15: https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch15.html Here is chapter 16: https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch16.html Note from the Author or Editor: Fixed for first release.	Anonymous	Jul 16, 2017	Feb 08, 2018
PDF,	Page 61 Last Section	Hi, There is a Typo error in the first line on Pg-61 under section "Creating Rows". Original: You can create rows by manually instantiating a Row object with the values that below in each column. Rectified: You can create rows by manually instantiating a Row object with the values that belong in each column. Thanks, Manish Bahrani Note from the Author or Editor: This typo was fixed for the first release.	Manish Bahrani	Jul 05, 2017	Feb 08, 2018
PDF,	Page 30 Last Paragraph (scala version of code)	On Page-30, below is the original scala version of code - %scala purchaseByCustomerPerHour.writeStream .format(“memory”) // memory = store in-memory table .queryName(“customer_purchases”) // counts = name of the in-memory table .outputMode(“complete”) // complete = all the counts should be in the table .start() On 4th line, the comments for ".queryName() method" - Original: // counts = name of the in-memory table Rectified: // customer_purchases = name of the in-memory table Thanks, Manish Bahrani Note from the Author or Editor: This was fixed for the first release.	Manish Bahrani	Jul 05, 2017	Feb 08, 2018
PDF,	Page 10 4th paragraph	PDF has "ight" instead of "might" in the paragraph describing Lazy Evaluation. 5th sentence has --- An example of this ight be “predicate pushdown” I suppose it should be --- An example of this might be “predicate pushdown” Note from the Author or Editor: Fixed for first release.	Pradeep Nalabalapu	Jun 07, 2017	Feb 08, 2018
	1 First chapter (Safari Books Online), in the "A Basic Transformation Data Flow" section, under Figure-9.	Hi, Comma-separated values misspelt as "comma seperated value". Paragraph: "Now hopefully you have grasped the basics but let’s just reinforce some of the core concepts with another data pipeline. We’re going to be using the same flight data used except that this time we’ll be using a copy of the data in comma seperated value (CSV) format." Note from the Author or Editor: Fixed for first release.	Simon Bensoussan	Mar 27, 2017	Feb 08, 2018
	1 Chapter 1, under the "Spark Applications" header, just before Figure 1-1	Hi, Thank you for such a great resource on Spark. There is just a little typo on the first chapter, under the "Spark Applications" header, just before Figure 1-1 (read on Safari Books Online), where applications is misspelt as "appliications" (see below, last sentence). "The last piece relevant piece for us is the cluster manager. The cluster manager controls physical machines and allocates resources to Spark applications. This can be one of several core cluster managers: Spark’s standalone cluster manager, YARN, or Mesos. This means that there can be multiple Spark appliications running on a cluster at the same time." Best, Simon Bensoussan Note from the Author or Editor: Fixed for first release.	Simon Bensoussan	Mar 27, 2017	Feb 08, 2018