The errata list is a list of errors and their corrections that were found after the product was released.
The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.
Version |
Location |
Description |
Submitted by |
Date submitted |
|
NA
Chapter 3, Subsection "Getting unique rows" |
The SQL sample code is not consistent with %scala & %python versions. It includes "distinct" which is absent in the scala/python example.
%sql
SELECT COUNT(DISTINCT ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)FROM dfTable
|
Emmanuel Asimadi |
Nov 18, 2017 |
|
NA
Chapter 5, Section: Aggregating to complex types |
SQL sample code, second column should be 'collect_list as seen in python & scala versions.
%sql
SELECT
collect_set(Country),
collect_set(Country)
FROM
dfTable
|
Emmanuel Asimadi |
Nov 21, 2017 |
|
NA
Section 8: Spark SQL, Sub-Section: Creating Views |
the GLOBAL keyword is missing from the sql statement.
%sql
CREATE VIEW just_usa_global AS SELECT * FROM flights WHERE dest_country_name = 'United States'
|
Emmanuel Asimadi |
Nov 24, 2017 |
|
|
in Chapter 6 "Working with Strings" -> Regular Expressions, code example has a function color_locator that incorrectly uses variable "c" when it should have been "color_string".
|
venki |
Feb 22, 2019 |
Mobi |
Page NA
Location 7854 in Kindle Edition (in "Garbage collection tuning" section) |
Instead of "4*3*128MB" [1] you have "43, 128".
[1]
https://spark.apache.org/docs/latest/tuning.html
|
Colin Jack |
Apr 26, 2019 |
|
Ch3
Paragraph immediately following heading "Machine Learning and Advanced Analytics" |
Error: Typo of word "Structured"
Sentence Instance: "You can even use models trained in MLlib to make predictions in Strucutred Streaming."
|
McCoy Doherty |
Jul 09, 2019 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which
Suggested
"a signficant amount" should be "a significant amount"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page location 309
text |
Current Copy
If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide.
Suggested
"as desribed at" should be "as described at"
|
Anonymous |
Jan 04, 2021 |
Mobi |
Page n/a
text |
Current Copy
Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This
Suggested
"the llustration" should be "the illustration"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
This function converts a type in another language to its correspnding Spark representation.
Suggested
"its correspnding Spark" should be "its corresponding Spark"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
This function converts a type in another language to its correspnding Spark representation.
Suggested
"its correspnding Spark" should be "its corresponding Spark"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use
Suggested
"more sophsticated join" should be "more sophisticated join"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark
Suggested
"variables ust listed," should be "variables just listed,"
|
Anonymous |
Jan 04, 2021 |
PDF, ePub, Mobi, |
Page n/a
text |
Current Copy
4.0| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but
Suggested
"techniques descriubed here" should be "techniques described here"
|
Anonymous |
Jan 04, 2021 |
|
NA
Chapter 7 : Aggregations Sub section : sum |
Under the sub section "sum", it is written:
Another simple task is to add all the values in a row using the sum function
I think it should be "values in a column", and not row.
|
Priyank Gupta |
Aug 09, 2021 |
ePub |
Page n/a
text |
TYPOS:
1.
Current Copy
optimized streaming API. In this book, we will spend a signficant amount of time explaining these next-generation APIs, most of which
Suggested
"a signficant amount" should be "a significant amount"
2.
Current Copy
If you want to run the code locally, you can download them from the official code repository in this book as desribed at https://github.com/databricks/Spark-The-Definitive-Guide.
Suggested
"as desribed at" should be "as described at"
3.
Current Copy
Figure 2-10 because of optimizations in the physical execution; however, the llustration is as good of a starting point as any. This
Suggested
"however, the llustration is as good" should be "however, the illustration is as good"
4.
Current Copy
This function converts a type in another language to its correspnding Spark representation.
Suggested
"its correspnding Spark" should be "its corresponding Spark"
5.
Current Copy
not have matching keys. Spark also allows for much more sophsticated join policies in addition to equi-joins. We can even use
Suggested
"more sophsticated join" should be "more sophisticated join"
6.
Current Copy
will advertise to other machines. In addition to the variables ust listed, there are also options for setting up the Spark
Suggested
"variables ust listed," should be "variables just listed,"
7.
Current Copy
4.0| ... +----+----------------------------------------+ Advanced bucketing techniques The techniques descriubed here are the most common ways of bucketing data, but
Suggested
"techniques descriubed here" should be "techniques described here"
|
Anonymous |
May 04, 2022 |
PDF |
Page p 175
Bottom 1/3rd, 'Reading from SQL Databases' |
# in Python
driver = "org.sqlite.JDBC"
path = "/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqlite:" + path
tablename = "flight_info"
-- Probably due to updates in Spark, the above code is the source of an error when running the following line:
# in Python
dbDataFrame = spark.read.format("jdbc").option("url", url)\
.option("dbtable", tablename).option("driver", driver).load()
|
Brian Clements |
Aug 09, 2022 |
Printed |
Page page 421 (MLlib in Action)
Training and Evaluation, the example // in scala |
Multi-line chains of transformations need to be enclosed in parenthesis, for example as follows
scala> val params = (new ParamGridBuilder()
| .addGrid(rForm.formula, Array(
| "lab ~ . + color:value1",
| "lab ~ . + color:value1 + color:value2"))
| .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
| .addGrid(lr.regParam, Array(0.1, 2.0))
| .build()
| )
to fix following problem on page 422:
scala> val tvs = new TrainValidationSplit()
scala> .setTrainRatio(0.75) // also the default.
scala> .setEstimatorParamMaps(params)
<console>:32: error: type mismatch;
found : org.apache.spark.ml.tuning.ParamGridBuilder
required: Array[org.apache.spark.ml.param.ParamMap]
res11.setEstimatorParamMaps(params)
|
Martti Laiho |
Feb 06, 2023 |
|
1
chapter 7 section "Basics of Writing Data" |
The option("mode","overwrite") in this example doesn't have any effect when I tried it on Spark 2.2.
dataframe.write.format("csv")
.option("mode", "OVERWRITE")
.option("dateFormat", "yyyy-MM-dd")
.save("path/to/file(s)")
No matter what I give the value of option("mode","anyvalue"), it behaves like mode is "errorIfExists".
Instead, I need to use mode("overwrite")
dataframe.write.format("csv")
.mode("overwrite")
.option("dateFormat", "yyyy-MM-dd")
.save("path/to/file(s)")
Similarly, mode("append") works as expected but not option("mode",...)
Thanks,
Eddie
|
Anonymous |
Jul 12, 2017 |
|
2
Chapter 8 in section Views |
1. Probably wrong word
Views can be either just a saved query plan to be executed against the source table or they can be materialized which means that the results are precomputed (at the risk of going stable if the underlying table changes).
I think you mean "at the risk of going stale" rather than "going stable".
2. In section Creating Views
CREATE VIEW just_usa_view AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'
We can make it global by leveraging the GLOBAL keyword.
CREATE VIEW just_usa_global AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'
But there is no GLOBAL keyword in the 2nd example. It's confusing because the 2 view definitions are equivalent except for view name.
|
Anonymous |
Jul 12, 2017 |
|
4
Chapter 8 views |
1. Global view
after creating a global temp view
CREATE GLOBAL TEMP VIEW just_usa_global_view_temp AS
SELECT *
FROM flights
WHERE dest_country_name = 'United States'
I suggest to add a sample query because it involves a built-in object global_temp not discussed in the chapter. A sample query is just
select * from global_temp.just_usa_global_view_temp
2. CASE WHEN
There is a mistake of wrong case for "UNITED STATES". The query should be:
SELECT
CASE WHEN DEST_COUNTRY_NAME = 'United States' THEN 1
WHEN DEST_COUNTRY_NAME = 'Egypt' THEN 0
ELSE -1 END
FROM
partitioned_flights;
3. "For example, if we want to see whether or not we have a flight that will take you back from your destination country we could do so by checking whether or not there was a flight that had the destination country as an origin and a flight that had the origin country as a destination."
It seems that the condition could be fulfilled by 2 flights instead of one flight. An example to meet this requirement "a flight that will take you back from your destination country" is
a flight from "United States" to "Japan"
the return flight should be from "Japan" to "United States"
The conditions and the sample query are saying these 2 flights can provide a return flight for "United States" to "Japan":
i. from "Japan" to "Germany"
ii. from "France" to "United States"
I think the query should look like:
SELECT *
FROM flights f1
WHERE EXISTS (
SELECT 1
FROM flights f2
WHERE f2.origin_country_name = f1.dest_country_name AND
f2.dest_country_name = f1.origin_country_name);
Then only a flight from "Japan" to "United States" can be a return flight of "United States" to "Japan".
4. I think the description of when to call "refresh table" is not clear unless one has background of Hive. Would you consider to elaborate under what circumstances "refresh table" is needed?
|
Anonymous |
Jul 13, 2017 |
|
4
chapter 12 |
1) section glom
sc.parallelize(Seq("Hello", "World"), 2).glom().collect()
I suggest to use
words.glom().collect()
because this call in previous section mapPartitions gives very nice output
words.mapPartitionsWithIndex(indexedFunc).collect()
res6: Array[String] = Array(Partition: 0 => Spark, Partition: 0 => The, Partition: 0 => Definitive, Partition: 0 => Guide, Partition: 0 => :, Partition: 1 => Big, Partition: 1 => Data, Partition: 1 => Processing, Partition: 1 => Made, Partition: 1 => Simple)
calling words.glom().collect() gives
Array[Array[String]] = Array(Array(Spark, The, Definitive, Guide, :), Array(Big, Data, Processing, Made, Simple))
The output is an Array with 2 child Array's from each partition.
2) section reduceByKey
scala> KVcharaters.reduceByKey(addFunc).collect()
<console>:30: error: not found: value KVcharaters
KVcharaters.reduceByKey(addFunc).collect()
but
scala> KVcharacters.reduceByKey(addFunc(_,_)).collect()
res26: Array[(Char, Int)] = Array((d,4), (p,3), (t,3), (h,1), (l,1), (e,7), (a,4), (i,7), (u,1), (m,2), (b,1), (n,2), (f,1), (v,1), (:,1), (r,2), (s,4), (k,1), (o,1), (g,3), (c,1))
3) is it possible to elaborate on treeAgreegate. Even Spark documentation doesn't explain it well.
4. section Joins
scala> sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))
<console>:36: error: type mismatch;
found : org.apache.spark.rdd.RDD[(Char, Double)]
required: Seq[?]
Error occurred in an application involving default arguments.
sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))
It's due to re-using the variable distinctCharts. The nearest definition is
val distinctChars = words
.flatMap(word => word.toLowerCase.toSeq)
.distinct
which gives the error but the previous one in section sampleByKey
val distinctChars = words
.flatMap(word => word.toLowerCase.toSeq)
.distinct
.collect()
works for sc.parallelize(distinctChars.map(c => (c, new Random().nextDouble())))
Instead of having 2 definitions of distinctChars, I fixed my working examples by using one definition of distinctChars in sampleByKey
val distinctChars = words.
flatMap(word => word.toLowerCase.toSeq).
distinct
and then add collect() before toMap
val sampleMap = distinctChars.map(c => (c, new Random().nextDouble())).collect().toMap
Remove the definition of distinctChars in section Joins and define keyChars as
val keyedChars = distinctChars.map(c => (c, new Random().nextDouble()))
|
Anonymous |
Jul 14, 2017 |
PDF |
Page 21
NOTE |
The prase in the note on page 21 has a typo:
In local mode,
the driver and executurs run (as threads) on your individual computer instead of a cluster.
Word: executurs
|
Anonymous |
Feb 27, 2021 |
ePub |
Page 23
4th paragraph |
In local mode, the driver and executurs run (as threads) on your individual computer instead of a cluster.
While not beeing an expert at all in the technology, I would guess it was meant to be written :
[...] the driver and executors run [...]
|
Philippe Bourrel |
Dec 24, 2019 |
PDF |
Page 42
Line 2 |
This is from page 42 of the abbreviated book provided directly by Databricks. Actual page number in final document is probably different.
GraphFrames syntax for pageRank is incorrect.
PDF has
ranks = stationGraph.pageRank(maxIter=10).resetProbability(0.15).run()
Correct syntax is
ranks = stationGraph.pageRank(maxIter=10, resetProbability=0.15)
|
Dave Welden |
Dec 06, 2017 |
Printed |
Page 51
3rd paragraph |
"is aslightly inaccurate" should be "is slightly inaccurate"
|
Kye Okabe |
Mar 08, 2020 |
ePub |
Page 61
4th paragraph |
"The only difference will by syntax."
|
Philippe Bourrel |
Jan 03, 2020 |
Printed |
Page 109
first code snippet, for the python's example, it reads: df.select(map(col("Description")... |
first code snippet, for the python's example, it reads: df.select(map(col("Description")...
It should read
df.select(create_map(col("Description")...
|
Sergio SainzPalacios |
Jun 07, 2020 |
Printed |
Page 113
Figure 6-2 |
The figure caption for Figure 6-2 says "Figure caption" (should be something along the lines of "Overview of the internal process when using UDFs written in Python").
|
Kye Okabe |
Mar 08, 2020 |
Printed |
Page 114
3rd code block |
It seems that the code block is for its previous paragraph, while I cannot see any value from the code block.
When you want to optionally return a value from a UDF, you should return None in Python and an Option type in Scala:
## Hive UDFs
|
Acan Chen |
Apr 08, 2021 |
Printed |
Page 126
Top 1/3rd |
the first grouping SQL command reads:
SELECT
COUNT(*)
FROM
DfTable
GROUP BY
InvoiceNo, CustomerID
The result shows 3 columns: InvoiceNo, CustomerID and Count.
To display a result, a column must be listed in a select statement. The correct query should be:
SELECT
InvoiceNo, CustomerID, COUNT(*)
FROM
DfTable
GROUP BY
InvoiceNo, CustomerID
|
Brad Lee HInes |
Apr 30, 2021 |
Printed |
Page 166
First, top paragraph |
It reads
"Although SQLite makes for a good reference example, it's probablu not".
It should read "probably"
|
Sergio Sainz Palacios |
Jun 07, 2020 |
Printed |
Page 255
First sentence of the completion paragraph |
"driver processs exits" should be "driver process exits"
|
Kye Okabe |
Mar 09, 2020 |
Printed |
Page 279
7th |
"In addition to the variables ust listed " should read "In addition to the variables just listed"
|
Emmanuel Mashandudze |
Nov 11, 2019 |
Printed |
Page 281
First sentence under the three bullets |
"For the most, part Spark [...]" should be "For the most part, Spark [...]"
|
Kye Okabe |
Mar 09, 2020 |
Printed |
Page 321
2nd paragraph, last sentence |
"megatbytes" should be "megabytes"
|
Kye Okabe |
Mar 01, 2020 |
Printed |
Page 337
3rd paragraph |
Actual sentence: "What if a machine in a sytem fails, losing some state?"
Should be: "What if a machine in a system fails, losing some state?"
|
Anonymous |
Jan 24, 2022 |
Printed |
Page 340
last paragraph, last sentence |
The sentence seems a bit odd.
"[...] the streaming applications that are large-scale enough to need to distribute their computation tend to prioritize throughput [...]"
Perhaps something like
"[...] large-scale streaming applications that need to distribute their computation tend to prioritize throughput [...]"
would sound a bit smoother.
|
Kye Okabe |
Mar 01, 2020 |
PDF |
Page 411
3rd code block |
In Python code, the call to print lr.explainParams() is missing parentheses.
|
Anonymous |
Mar 12, 2021 |
Other Digital Version |
12164
Bisecting k-means Summary |
This is from the Kindle version which doesn't include page numbers, so I included the "Location" instead.
In the section titled "Bisecting k-means Summary" in Chapter 29 I think there is a small typo. Instead of using the bisecting k-means model to find information you use the normal k-means model that was introduced in the previous section.
So I believe that:
kmModel.computeCost( sales)
println(" Cluster Centers: ")
kmModel.clusterCenters.foreach( println)
Should probably be:
bkmModel.computeCost( sales)
println(" Cluster Centers: ")
bkmModel.clusterCenters.foreach( println)
|
Matthew Dabbert |
Nov 04, 2019 |