Errata

Spark: The Definitive Guide

Errata for Spark: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
NA
Chapter 3, Subsection "Getting unique rows"

The SQL sample code is not consistent with %scala & %python versions. It includes "distinct" which is absent in the scala/python example.

%sql
SELECT COUNT(DISTINCT ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)FROM dfTable

Emmanuel Asimadi  Nov 18, 2017 
NA
Chapter 5, Section: Aggregating to complex types

SQL sample code, second column should be 'collect_list as seen in python & scala versions.

%sql

SELECT
collect_set(Country),
collect_set(Country)
FROM
dfTable

Emmanuel Asimadi  Nov 21, 2017 
NA
Section 8: Spark SQL, Sub-Section: Creating Views

the GLOBAL keyword is missing from the sql statement.

%sql
CREATE VIEW just_usa_global AS SELECT * FROM flights WHERE dest_country_name = 'United States'

Emmanuel Asimadi  Nov 24, 2017 

in Chapter 6 "Working with Strings" -> Regular Expressions, code example has a function color_locator that incorrectly uses variable "c" when it should have been "color_string".

venki  Feb 22, 2019 
Mobi Page NA
Location 7854 in Kindle Edition (in "Garbage collection tuning" section)

Instead of "4*3*128MB" [1] you have "43, 128".


[1]
https://spark.apache.org/docs/latest/tuning.html

Colin Jack  Apr 26, 2019 
Ch3
Paragraph immediately following heading "Machine Learning and Advanced Analytics"

Error: Typo of word "Structured"

Sentence Instance: "You can even use models trained in MLlib to make predictions in Strucutred Streaming."

McCoy Doherty  Jul 09, 2019 
PDF, ePub, Mobi, Page n/a
text

Current Copy
optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which

Suggested
"a signficant amount" should be "a significant amount"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page location 309
text

Current Copy
If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide.

Suggested
"as desribed at" should be "as described at"

Anonymous  Jan 04, 2021 
Mobi Page n/a
text

Current Copy
Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This

Suggested
"the llustration" should be "the illustration"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page n/a
text

Current Copy
This function converts a type in another language to its correspnding Spark representation.

Suggested
"its correspnding Spark" should be "its corresponding Spark"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page n/a
text

Current Copy
This function converts a type in another language to its correspnding Spark representation.

Suggested
"its correspnding Spark" should be "its corresponding Spark"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page n/a
text

Current Copy
not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use

Suggested
"more sophsticated join" should be "more sophisticated join"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page n/a
text

Current Copy
will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark

Suggested
"variables ust listed," should be "variables just listed,"

Anonymous  Jan 04, 2021 
PDF, ePub, Mobi, Page n/a
text

Current Copy
4.0| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but


Suggested
"techniques descriubed here" should be "techniques described here"

Anonymous  Jan 04, 2021 
NA
Chapter 7 : Aggregations Sub section : sum

Under the sub section "sum", it is written:
Another simple task is to add all the values in a row using the sum function

I think it should be "values in a column", and not row.

Priyank Gupta  Aug 09, 2021 
ePub Page n/a
text

TYPOS:

1.
Current Copy
optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which

Suggested
"a signficant amount" should be "a significant amount"


2.
Current Copy
If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide.

Suggested
"as desribed at" should be "as described at"

3.
Current Copy
Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This

Suggested
"however, the llustration is as good" should be "however, the illustration is as good"


4.
Current Copy
This function converts a type in another language to its correspnding Spark representation.

Suggested
"its correspnding Spark" should be "its corresponding Spark"

5.
Current Copy
not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use

Suggested
"more sophsticated join" should be "more sophisticated join"

6.
Current Copy
will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark

Suggested
"variables ust listed," should be "variables just listed,"

7.
Current Copy
4.0| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but

Suggested
"techniques descriubed here" should be "techniques described here"

Anonymous  May 04, 2022 
PDF Page p 175
Bottom 1/3rd, 'Reading from SQL Databases'

# in Python
driver = "org.sqlite.JDBC"
path = "/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqlite:" + path
tablename = "flight_info"


-- Probably due to updates in Spark, the above code is the source of an error when running the following line:

# in Python
dbDataFrame = spark.read.format("jdbc").option("url", url)\
.option("dbtable", tablename).option("driver", driver).load()

Brian Clements  Aug 09, 2022 
Printed Page page 421 (MLlib in Action)
Training and Evaluation, the example // in scala

Multi-line chains of transformations need to be enclosed in parenthesis, for example as follows
scala> val params = (new ParamGridBuilder()
| .addGrid(rForm.formula, Array(
| "lab ~ . + color:value1",
| "lab ~ . + color:value1 + color:value2"))
| .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
| .addGrid(lr.regParam, Array(0.1, 2.0))
| .build()
| )

to fix following problem on page 422:
scala> val tvs = new TrainValidationSplit()
scala> .setTrainRatio(0.75) // also the default.
scala> .setEstimatorParamMaps(params)
<console>:32: error: type mismatch;
found : org.apache.spark.ml.tuning.ParamGridBuilder
required: Array[org.apache.spark.ml.param.ParamMap]
res11.setEstimatorParamMaps(params)

Martti Laiho  Feb 06, 2023 
1
chapter 7 section "Basics of Writing Data"

The option("mode","overwrite") in this example doesn't have any effect when I tried it on Spark 2.2.

dataframe.write.format("csv")
.option("mode", "OVERWRITE")
.option("dateFormat", "yyyy-MM-dd")
.save("path/to/file(s)")

No matter what I give the value of option("mode","anyvalue"), it behaves like mode is "errorIfExists".

Instead, I need to use mode("overwrite")
dataframe.write.format("csv")
.mode("overwrite")
.option("dateFormat", "yyyy-MM-dd")
.save("path/to/file(s)")

Similarly, mode("append") works as expected but not option("mode",...)

Thanks,
Eddie

Anonymous  Jul 12, 2017 
2
Chapter 8 in section Views

1. Probably wrong word
Views can be either just a saved query plan to be executed against the source table or they can be materialized which means that the results are precomputed (at the risk of going stable if the underlying table changes).

I think you mean "at the risk of going stale" rather than "going stable".

2. In section Creating Views
CREATE VIEW just_usa_view AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'

We can make it global by leveraging the GLOBAL keyword.

CREATE VIEW just_usa_global AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'

But there is no GLOBAL keyword in the 2nd example. It's confusing because the 2 view definitions are equivalent except for view name.

Anonymous  Jul 12, 2017 
4
Chapter 8 views

1. Global view
after creating a global temp view
CREATE GLOBAL TEMP VIEW just_usa_global_view_temp AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'

I suggest to add a sample query because it involves a built-in object global_temp not discussed in the chapter. A sample query is just

select * from global_temp.just_usa_global_view_temp

2. CASE WHEN
There is a mistake of wrong case for "UNITED STATES". The query should be:
SELECT
CASE WHEN DEST_COUNTRY_NAME = 'United States' THEN 1
WHEN DEST_COUNTRY_NAME = 'Egypt' THEN 0
ELSE -1 END
FROM
partitioned_flights;

3. "For example, if we want to see whether or not we have a flight that will take you back from your destination country we could do so by checking whether or not there was a flight that had the destination country as an origin and a flight that had the origin country as a destination."

It seems that the condition could be fulfilled by 2 flights instead of one flight. An example to meet this requirement "a flight that will take you back from your destination country" is
a flight from "United States" to "Japan"
the return flight should be from "Japan" to "United States"

The conditions and the sample query are saying these 2 flights can provide a return flight for "United States" to "Japan":
i. from "Japan" to "Germany"
ii. from "France" to "United States"

I think the query should look like:
SELECT *
FROM flights f1
WHERE EXISTS (
SELECT 1
FROM flights f2
WHERE f2.origin_country_name = f1.dest_country_name AND
f2.dest_country_name = f1.origin_country_name);

Then only a flight from "Japan" to "United States" can be a return flight of "United States" to "Japan".

4. I think the description of when to call "refresh table" is not clear unless one has background of Hive. Would you consider to elaborate under what circumstances "refresh table" is needed?

Anonymous  Jul 13, 2017 
4
chapter 12

1) section glom
sc.parallelize(Seq("Hello", "World"), 2).glom().collect()

I suggest to use
words.glom().collect()

because this call in previous section mapPartitions gives very nice output

words.mapPartitionsWithIndex(indexedFunc).collect()
res6: Array[String] = Array(Partition: 0 => Spark, Partition: 0 => The, Partition: 0 => Definitive, Partition: 0 => Guide, Partition: 0 => :, Partition: 1 => Big, Partition: 1 => Data, Partition: 1 => Processing, Partition: 1 => Made, Partition: 1 => Simple)

calling words.glom().collect() gives
Array[Array[String]] = Array(Array(Spark, The, Definitive, Guide, :), Array(Big, Data, Processing, Made, Simple))

The output is an Array with 2 child Array's from each partition.

2) section reduceByKey

scala> KVcharaters.reduceByKey(addFunc).collect()
<console>:30: error: not found: value KVcharaters
KVcharaters.reduceByKey(addFunc).collect()

but
scala> KVcharacters.reduceByKey(addFunc(_,_)).collect()
res26: Array[(Char, Int)] = Array((d,4), (p,3), (t,3), (h,1), (l,1), (e,7), (a,4), (i,7), (u,1), (m,2), (b,1), (n,2), (f,1), (v,1), (:,1), (r,2), (s,4), (k,1), (o,1), (g,3), (c,1))

3) is it possible to elaborate on treeAgreegate. Even Spark documentation doesn't explain it well.

4. section Joins

scala> sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))
<console>:36: error: type mismatch;
found : org.apache.spark.rdd.RDD[(Char, Double)]
required: Seq[?]
Error occurred in an application involving default arguments.
sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))

It's due to re-using the variable distinctCharts. The nearest definition is
val distinctChars = words
.flatMap(word => word.toLowerCase.toSeq)
.distinct

which gives the error but the previous one in section sampleByKey
val distinctChars = words
.flatMap(word => word.toLowerCase.toSeq)
.distinct
.collect()

works for sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))

Instead of having 2 definitions of distinctChars, I fixed my working examples by using one definition of distinctChars in sampleByKey
val distinctChars = words.
flatMap(word => word.toLowerCase.toSeq).
distinct

and then add collect() before toMap
val sampleMap = distinctChars.map(c => (c, new Random().nextDouble())).collect().toMap

Remove the definition of distinctChars in section Joins and define keyChars as
val keyedChars = distinctChars.map(c => (c, new Random().nextDouble()))

Anonymous  Jul 14, 2017 
PDF Page 21
NOTE

The prase in the note on page 21 has a typo:

In local mode,
the driver and executurs run (as threads) on your individual computer instead of a cluster.

Word: executurs

Anonymous  Feb 27, 2021 
ePub Page 23
4th paragraph

In local mode, the driver and executurs run (as threads) on your individual computer instead of a cluster.

While not beeing an expert at all in the technology, I would guess it was meant to be written :

[...] the driver and executors run [...]

Philippe Bourrel  Dec 24, 2019 
PDF Page 42
Line 2

This is from page 42 of the abbreviated book provided directly by Databricks. Actual page number in final document is probably different.

GraphFrames syntax for pageRank is incorrect.

PDF has
ranks = stationGraph.pageRank(maxIter=10).resetProbability(0.15).run()

Correct syntax is
ranks = stationGraph.pageRank(maxIter=10, resetProbability=0.15)

Dave Welden  Dec 06, 2017 
Printed Page 51
3rd paragraph

"is aslightly inaccurate" should be "is slightly inaccurate"

Kye Okabe  Mar 08, 2020 
ePub Page 61
4th paragraph

"The only difference will by syntax."

Philippe Bourrel  Jan 03, 2020 
Printed Page 109
first code snippet, for the python's example, it reads: df.select(map(col("Description")...

first code snippet, for the python's example, it reads: df.select(map(col("Description")...

It should read
df.select(create_map(col("Description")...

Sergio SainzPalacios  Jun 07, 2020 
Printed Page 113
Figure 6-2

The figure caption for Figure 6-2 says "Figure caption" (should be something along the lines of "Overview of the internal process when using UDFs written in Python").

Kye Okabe  Mar 08, 2020 
Printed Page 114
3rd code block

It seems that the code block is for its previous paragraph, while I cannot see any value from the code block.

When you want to optionally return a value from a UDF, you should return None in Python and an Option type in Scala:
## Hive UDFs

Acan Chen  Apr 08, 2021 
Printed Page 126
Top 1/3rd

the first grouping SQL command reads:

SELECT
COUNT(*)
FROM
DfTable
GROUP BY
InvoiceNo, CustomerID

The result shows 3 columns: InvoiceNo, CustomerID and Count.
To display a result, a column must be listed in a select statement. The correct query should be:

SELECT
InvoiceNo, CustomerID, COUNT(*)
FROM
DfTable
GROUP BY
InvoiceNo, CustomerID

Brad Lee HInes  Apr 30, 2021 
Printed Page 166
First, top paragraph

It reads
"Although SQLite makes for a good reference example, it's probablu not".

It should read "probably"

Sergio Sainz Palacios  Jun 07, 2020 
Printed Page 255
First sentence of the completion paragraph

"driver processs exits" should be "driver process exits"

Kye Okabe  Mar 09, 2020 
Printed Page 279
7th

"In addition to the variables ust listed " should read "In addition to the variables just listed"

Emmanuel Mashandudze  Nov 11, 2019 
Printed Page 281
First sentence under the three bullets

"For the most, part Spark [...]" should be "For the most part, Spark [...]"

Kye Okabe  Mar 09, 2020 
Printed Page 321
2nd paragraph, last sentence

"megatbytes" should be "megabytes"

Kye Okabe  Mar 01, 2020 
Printed Page 337
3rd paragraph

Actual sentence: "What if a machine in a sytem fails, losing some state?"

Should be: "What if a machine in a system fails, losing some state?"

Anonymous  Jan 24, 2022 
Printed Page 340
last paragraph, last sentence

The sentence seems a bit odd.

"[...] the streaming applications that are large-scale enough to need to distribute their computation tend to prioritize throughput [...]"

Perhaps something like

"[...] large-scale streaming applications that need to distribute their computation tend to prioritize throughput [...]"

would sound a bit smoother.

Kye Okabe  Mar 01, 2020 
PDF Page 411
3rd code block

In Python code, the call to print lr.explainParams() is missing parentheses.

Anonymous  Mar 12, 2021 
Other Digital Version 12164
Bisecting k-means Summary

This is from the Kindle version which doesn't include page numbers, so I included the "Location" instead.

In the section titled "Bisecting k-means Summary" in Chapter 29 I think there is a small typo. Instead of using the bisecting k-means model to find information you use the normal k-means model that was introduced in the previous section.

So I believe that:

kmModel.computeCost( sales)
println(" Cluster Centers: ")
kmModel.clusterCenters.foreach( println)

Should probably be:

bkmModel.computeCost( sales)
println(" Cluster Centers: ")
bkmModel.clusterCenters.foreach( println)

Matthew Dabbert  Nov 04, 2019