Errata

Errata for Spark: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
	NA Chapter 3, Subsection "Getting unique rows"	The SQL sample code is not consistent with %scala & %python versions. It includes "distinct" which is absent in the scala/python example. %sql SELECT COUNT(DISTINCT ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)FROM dfTable	Emmanuel Asimadi	Nov 18, 2017
	NA Chapter 5, Section: Aggregating to complex types	SQL sample code, second column should be 'collect_list as seen in python & scala versions. %sql SELECT collect_set(Country), collect_set(Country) FROM dfTable	Emmanuel Asimadi	Nov 21, 2017
	NA Section 8: Spark SQL, Sub-Section: Creating Views	the GLOBAL keyword is missing from the sql statement. %sql CREATE VIEW just_usa_global AS SELECT * FROM flights WHERE dest_country_name = 'United States'	Emmanuel Asimadi	Nov 24, 2017
		in Chapter 6 "Working with Strings" -> Regular Expressions, code example has a function color_locator that incorrectly uses variable "c" when it should have been "color_string".	venki	Feb 22, 2019
Mobi	Page NA Location 7854 in Kindle Edition (in "Garbage collection tuning" section)	Instead of "43128MB" [1] you have "43, 128". [1] https://spark.apache.org/docs/latest/tuning.html	Colin Jack	Apr 26, 2019
	Ch3 Paragraph immediately following heading "Machine Learning and Advanced Analytics"	Error: Typo of word "Structured" Sentence Instance: "You can even use models trained in MLlib to make predictions in Strucutred Streaming."	McCoy Doherty	Jul 09, 2019
PDF, ePub, Mobi,	Page n/a text	Current Copy optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which Suggested "a signficant amount" should be "a significant amount"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page location 309 text	Current Copy If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide. Suggested "as desribed at" should be "as described at"	Anonymous	Jan 04, 2021
Mobi	Page n/a text	Current Copy Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This Suggested "the llustration" should be "the illustration"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page n/a text	Current Copy This function converts a type in another language to its correspnding Spark representation. Suggested "its correspnding Spark" should be "its corresponding Spark"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page n/a text	Current Copy This function converts a type in another language to its correspnding Spark representation. Suggested "its correspnding Spark" should be "its corresponding Spark"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page n/a text	Current Copy not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use Suggested "more sophsticated join" should be "more sophisticated join"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page n/a text	Current Copy will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark Suggested "variables ust listed," should be "variables just listed,"	Anonymous	Jan 04, 2021
PDF, ePub, Mobi,	Page n/a text	Current Copy 4.0\| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but Suggested "techniques descriubed here" should be "techniques described here"	Anonymous	Jan 04, 2021
	NA Chapter 7 : Aggregations Sub section : sum	Under the sub section "sum", it is written: Another simple task is to add all the values in a row using the sum function I think it should be "values in a column", and not row.	Priyank Gupta	Aug 09, 2021
ePub	Page n/a text	TYPOS: 1. Current Copy optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which Suggested "a signficant amount" should be "a significant amount" 2. Current Copy If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide. Suggested "as desribed at" should be "as described at" 3. Current Copy Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This Suggested "however, the llustration is as good" should be "however, the illustration is as good" 4. Current Copy This function converts a type in another language to its correspnding Spark representation. Suggested "its correspnding Spark" should be "its corresponding Spark" 5. Current Copy not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use Suggested "more sophsticated join" should be "more sophisticated join" 6. Current Copy will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark Suggested "variables ust listed," should be "variables just listed," 7. Current Copy 4.0\| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but Suggested "techniques descriubed here" should be "techniques described here"	Anonymous	May 04, 2022
PDF	Page p 175 Bottom 1/3rd, 'Reading from SQL Databases'	# in Python driver = "org.sqlite.JDBC" path = "/data/flight-data/jdbc/my-sqlite.db" url = "jdbc:sqlite:" + path tablename = "flight_info" -- Probably due to updates in Spark, the above code is the source of an error when running the following line: # in Python dbDataFrame = spark.read.format("jdbc").option("url", url)\ .option("dbtable", tablename).option("driver", driver).load()	Brian Clements	Aug 09, 2022
Printed	Page page 421 (MLlib in Action) Training and Evaluation, the example // in scala	Multi-line chains of transformations need to be enclosed in parenthesis, for example as follows scala> val params = (new ParamGridBuilder() \| .addGrid(rForm.formula, Array( \| "lab ~ . + color:value1", \| "lab ~ . + color:value1 + color:value2")) \| .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)) \| .addGrid(lr.regParam, Array(0.1, 2.0)) \| .build() \| ) to fix following problem on page 422: scala> val tvs = new TrainValidationSplit() scala> .setTrainRatio(0.75) // also the default. scala> .setEstimatorParamMaps(params) <console>:32: error: type mismatch; found : org.apache.spark.ml.tuning.ParamGridBuilder required: Array[org.apache.spark.ml.param.ParamMap] res11.setEstimatorParamMaps(params)	Martti Laiho	Feb 06, 2023
	1 chapter 7 section "Basics of Writing Data"	The option("mode","overwrite") in this example doesn't have any effect when I tried it on Spark 2.2. dataframe.write.format("csv") .option("mode", "OVERWRITE") .option("dateFormat", "yyyy-MM-dd") .save("path/to/file(s)") No matter what I give the value of option("mode","anyvalue"), it behaves like mode is "errorIfExists". Instead, I need to use mode("overwrite") dataframe.write.format("csv") .mode("overwrite") .option("dateFormat", "yyyy-MM-dd") .save("path/to/file(s)") Similarly, mode("append") works as expected but not option("mode",...) Thanks, Eddie	Anonymous	Jul 12, 2017
	2 Chapter 8 in section Views	1. Probably wrong word Views can be either just a saved query plan to be executed against the source table or they can be materialized which means that the results are precomputed (at the risk of going stable if the underlying table changes). I think you mean "at the risk of going stale" rather than "going stable". 2. In section Creating Views CREATE VIEW just_usa_view AS SELECT * FROM flights WHERE dest_country_name = 'United States' We can make it global by leveraging the GLOBAL keyword. CREATE VIEW just_usa_global AS SELECT * FROM flights WHERE dest_country_name = 'United States' But there is no GLOBAL keyword in the 2nd example. It's confusing because the 2 view definitions are equivalent except for view name.	Anonymous	Jul 12, 2017
	4 Chapter 8 views	1. Global view after creating a global temp view CREATE GLOBAL TEMP VIEW just_usa_global_view_temp AS SELECT * FROM flights WHERE dest_country_name = 'United States' I suggest to add a sample query because it involves a built-in object global_temp not discussed in the chapter. A sample query is just select * from global_temp.just_usa_global_view_temp 2. CASE WHEN There is a mistake of wrong case for "UNITED STATES". The query should be: SELECT CASE WHEN DEST_COUNTRY_NAME = 'United States' THEN 1 WHEN DEST_COUNTRY_NAME = 'Egypt' THEN 0 ELSE -1 END FROM partitioned_flights; 3. "For example, if we want to see whether or not we have a flight that will take you back from your destination country we could do so by checking whether or not there was a flight that had the destination country as an origin and a flight that had the origin country as a destination." It seems that the condition could be fulfilled by 2 flights instead of one flight. An example to meet this requirement "a flight that will take you back from your destination country" is a flight from "United States" to "Japan" the return flight should be from "Japan" to "United States" The conditions and the sample query are saying these 2 flights can provide a return flight for "United States" to "Japan": i. from "Japan" to "Germany" ii. from "France" to "United States" I think the query should look like: SELECT * FROM flights f1 WHERE EXISTS ( SELECT 1 FROM flights f2 WHERE f2.origin_country_name = f1.dest_country_name AND f2.dest_country_name = f1.origin_country_name); Then only a flight from "Japan" to "United States" can be a return flight of "United States" to "Japan". 4. I think the description of when to call "refresh table" is not clear unless one has background of Hive. Would you consider to elaborate under what circumstances "refresh table" is needed?	Anonymous	Jul 13, 2017
	4 chapter 12	1) section glom sc.parallelize(Seq("Hello", "World"), 2).glom().collect() I suggest to use words.glom().collect() because this call in previous section mapPartitions gives very nice output words.mapPartitionsWithIndex(indexedFunc).collect() res6: Array[String] = Array(Partition: 0 => Spark, Partition: 0 => The, Partition: 0 => Definitive, Partition: 0 => Guide, Partition: 0 => :, Partition: 1 => Big, Partition: 1 => Data, Partition: 1 => Processing, Partition: 1 => Made, Partition: 1 => Simple) calling words.glom().collect() gives Array[Array[String]] = Array(Array(Spark, The, Definitive, Guide, :), Array(Big, Data, Processing, Made, Simple)) The output is an Array with 2 child Array's from each partition. 2) section reduceByKey scala> KVcharaters.reduceByKey(addFunc).collect() <console>:30: error: not found: value KVcharaters KVcharaters.reduceByKey(addFunc).collect() but scala> KVcharacters.reduceByKey(addFunc(_,_)).collect() res26: Array[(Char, Int)] = Array((d,4), (p,3), (t,3), (h,1), (l,1), (e,7), (a,4), (i,7), (u,1), (m,2), (b,1), (n,2), (f,1), (v,1), (:,1), (r,2), (s,4), (k,1), (o,1), (g,3), (c,1)) 3) is it possible to elaborate on treeAgreegate. Even Spark documentation doesn't explain it well. 4. section Joins scala> sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble()))) <console>:36: error: type mismatch; found : org.apache.spark.rdd.RDD[(Char, Double)] required: Seq[?] Error occurred in an application involving default arguments. sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble()))) It's due to re-using the variable distinctCharts. The nearest definition is val distinctChars = words .flatMap(word => word.toLowerCase.toSeq) .distinct which gives the error but the previous one in section sampleByKey val distinctChars = words .flatMap(word => word.toLowerCase.toSeq) .distinct .collect() works for sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble()))) Instead of having 2 definitions of distinctChars, I fixed my working examples by using one definition of distinctChars in sampleByKey val distinctChars = words. flatMap(word => word.toLowerCase.toSeq). distinct and then add collect() before toMap val sampleMap = distinctChars.map(c => (c, new Random().nextDouble())).collect().toMap Remove the definition of distinctChars in section Joins and define keyChars as val keyedChars = distinctChars.map(c => (c, new Random().nextDouble()))	Anonymous	Jul 14, 2017
PDF	Page 21 NOTE	The prase in the note on page 21 has a typo: In local mode, the driver and executurs run (as threads) on your individual computer instead of a cluster. Word: executurs	Anonymous	Feb 27, 2021
ePub	Page 23 4th paragraph	In local mode, the driver and executurs run (as threads) on your individual computer instead of a cluster. While not beeing an expert at all in the technology, I would guess it was meant to be written : [...] the driver and executors run [...]	Philippe Bourrel	Dec 24, 2019
PDF	Page 42 Line 2	This is from page 42 of the abbreviated book provided directly by Databricks. Actual page number in final document is probably different. GraphFrames syntax for pageRank is incorrect. PDF has ranks = stationGraph.pageRank(maxIter=10).resetProbability(0.15).run() Correct syntax is ranks = stationGraph.pageRank(maxIter=10, resetProbability=0.15)	Dave Welden	Dec 06, 2017
Printed	Page 51 3rd paragraph	"is aslightly inaccurate" should be "is slightly inaccurate"	Kye Okabe	Mar 08, 2020
ePub	Page 61 4th paragraph	"The only difference will by syntax."	Philippe Bourrel	Jan 03, 2020
Printed	Page 109 first code snippet, for the python's example, it reads: df.select(map(col("Description")...	first code snippet, for the python's example, it reads: df.select(map(col("Description")... It should read df.select(create_map(col("Description")...	Sergio SainzPalacios	Jun 07, 2020
Printed	Page 113 Figure 6-2	The figure caption for Figure 6-2 says "Figure caption" (should be something along the lines of "Overview of the internal process when using UDFs written in Python").	Kye Okabe	Mar 08, 2020
Printed	Page 114 3rd code block	It seems that the code block is for its previous paragraph, while I cannot see any value from the code block. When you want to optionally return a value from a UDF, you should return None in Python and an Option type in Scala: ## Hive UDFs	Acan Chen	Apr 08, 2021
Printed	Page 126 Top 1/3rd	the first grouping SQL command reads: SELECT COUNT() FROM DfTable GROUP BY InvoiceNo, CustomerID The result shows 3 columns: InvoiceNo, CustomerID and Count. To display a result, a column must be listed in a select statement. The correct query should be: SELECT InvoiceNo, CustomerID, COUNT() FROM DfTable GROUP BY InvoiceNo, CustomerID	Brad Lee HInes	Apr 30, 2021
Printed	Page 166 First, top paragraph	It reads "Although SQLite makes for a good reference example, it's probablu not". It should read "probably"	Sergio Sainz Palacios	Jun 07, 2020
Printed	Page 255 First sentence of the completion paragraph	"driver processs exits" should be "driver process exits"	Kye Okabe	Mar 09, 2020
Printed	Page 279 7th	"In addition to the variables ust listed " should read "In addition to the variables just listed"	Emmanuel Mashandudze	Nov 11, 2019
Printed	Page 281 First sentence under the three bullets	"For the most, part Spark [...]" should be "For the most part, Spark [...]"	Kye Okabe	Mar 09, 2020
Printed	Page 321 2nd paragraph, last sentence	"megatbytes" should be "megabytes"	Kye Okabe	Mar 01, 2020
Printed	Page 337 3rd paragraph	Actual sentence: "What if a machine in a sytem fails, losing some state?" Should be: "What if a machine in a system fails, losing some state?"	Anonymous	Jan 24, 2022
Printed	Page 340 last paragraph, last sentence	The sentence seems a bit odd. "[...] the streaming applications that are large-scale enough to need to distribute their computation tend to prioritize throughput [...]" Perhaps something like "[...] large-scale streaming applications that need to distribute their computation tend to prioritize throughput [...]" would sound a bit smoother.	Kye Okabe	Mar 01, 2020
PDF	Page 411 3rd code block	In Python code, the call to print lr.explainParams() is missing parentheses.	Anonymous	Mar 12, 2021
Other Digital Version	12164 Bisecting k-means Summary	This is from the Kindle version which doesn't include page numbers, so I included the "Location" instead. In the section titled "Bisecting k-means Summary" in Chapter 29 I think there is a small typo. Instead of using the bisecting k-means model to find information you use the normal k-means model that was introduced in the previous section. So I believe that: kmModel.computeCost( sales) println(" Cluster Centers: ") kmModel.clusterCenters.foreach( println) Should probably be: bkmModel.computeCost( sales) println(" Cluster Centers: ") bkmModel.clusterCenters.foreach( println)	Matthew Dabbert	Nov 04, 2019