Errata

Spark: The Definitive Guide

Errata for Spark: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
PDF,
chapter 1, 8th paragraph

Chapter 1: Paragraph 8: This is from ebook
* "The last piece relevant piece for us is the cluster manager." Looks like grammer mistake piece repeated twice.
* Typo on "appliications"

Note from the Author or Editor:
Fixed for first release

Saad Khawaja  Oct 17, 2017  Feb 08, 2018
NA
subsection Columns, 2nd Paragraph

The seems to be a typo in Chapter 3 subsection Columns, Paragraph 2....."this column may or may not exist in our of our DataFrames." probably should be
"this column may or may not exist in our DataFrames." instead.

Note from the Author or Editor:
Fixed for first release.

Emmanuel Asimadi  Nov 18, 2017  Feb 08, 2018
NA
Chapter 3, Subsection "Creating Row"

The return type for below should be Int instead of string.
myRow.getInt(2) // String

Note from the Author or Editor:
Fixed for first release.

Emmanuel Asimadi  Nov 18, 2017  Feb 08, 2018
NA
Chapter 3, Subsection "Creating Dataframes"

Probably should be "encounter" instead of "encourage".
"With these three tools, you should be able to solve the vast majority of transformation challenges that you may encourage in DataFrames."

Note from the Author or Editor:
Fixed for first release.

Emmanuel Asimadi  Nov 18, 2017  Feb 08, 2018
na
Chapter 5, Section: Aggregating to complex types

repeated.
A cube takes the rollup *takes a rollup* to a level deeper.

Note from the Author or Editor:
Fixed for first release.

Emmanuel Asimadi  Nov 22, 2017  Feb 08, 2018
PDF,
Page cover

The cover of the 1st edition still says it's an "Early Release".

Harald Gegenfurtner  Dec 31, 2017  Feb 08, 2018
I
"Who This Book is For" section

There is a typo of "efficienly". The correct word is "efficiently".

Note from the Author or Editor:
Fixed for first release.

Keiji Yoshida  Jan 12, 2018  Feb 08, 2018
1
Chapter 1, under the "Spark Applications" header, just before Figure 1-1

Hi,

Thank you for such a great resource on Spark.

There is just a little typo on the first chapter, under the "Spark Applications" header, just before Figure 1-1 (read on Safari Books Online), where applications is misspelt as "appliications" (see below, last sentence).

"The last piece relevant piece for us is the cluster manager. The cluster manager controls physical machines and allocates resources to Spark applications. This can be one of several core cluster managers: Spark’s standalone cluster manager, YARN, or Mesos. This means that there can be multiple Spark appliications running on a cluster at the same time."

Best,

Simon Bensoussan

Note from the Author or Editor:
Fixed for first release.

Simon Bensoussan  Mar 27, 2017  Feb 08, 2018
1
First chapter (Safari Books Online), in the "A Basic Transformation Data Flow" section, under Figure-9.

Hi,

Comma-separated values misspelt as "comma seperated value".

Paragraph:
"Now hopefully you have grasped the basics but let’s just reinforce some of the core concepts with another data pipeline. We’re going to be using the same flight data used except that this time we’ll be using a copy of the data in comma seperated value (CSV) format."

Note from the Author or Editor:
Fixed for first release.

Simon Bensoussan  Mar 27, 2017  Feb 08, 2018
1
Chapter 16

chapter 15 and chapter 16 have the same content on Safari Books Online early release https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/

Here is chapter 15: https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch15.html
Here is chapter 16: https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch16.html

Note from the Author or Editor:
Fixed for first release.

Anonymous  Jul 16, 2017  Feb 08, 2018
PDF,
Page 10
4th paragraph

PDF has "ight" instead of "might" in the paragraph describing Lazy Evaluation.

5th sentence has ---
An example of this ight be “predicate pushdown”

I suppose it should be ---
An example of this might be “predicate pushdown”

Note from the Author or Editor:
Fixed for first release.

Pradeep Nalabalapu  Jun 07, 2017  Feb 08, 2018
PDF,
Page 19
Chapter 1, paragraph 3

Last word of the paragraph contains a typo: "langauge". It should be "language":

"Spark Core consists of two APIs. The Unstructured and Structured APIs. The Unstructured API is Spark’s lower level set of APIs including Resilient Distributed Datasets (RDDs), Accumulators, and Broadcast variables. The Structured API consists of DataFrames, Datasets, Spark SQL and is the interface that most users should use. The difference between the two is that one is optimized to work with structured data in a spreadsheet-like interface while the other is meant for manipulation of raw java objects. Outside of Spark Core sit a variety of tools, libraries, and languages like MLlib for performing machine learning, the GraphX module for performing graph processing, and SparkR for working with Spark clusters from the R langauge."

Note from the Author or Editor:
Fixed for first release.

Anonymous  Jan 18, 2018  Feb 08, 2018
Printed
Page 20
3rd paragraph (Lazy Evaluation section)

In the start of the paragraph "Lazy evaulation" the word "evaluation" has a typo.

Note from the Author or Editor:
Yup, it does! Please correct the spelling.

Sertan Şentürk  Apr 13, 2018 
PDF,
Page 30
Last Paragraph (scala version of code)

On Page-30, below is the original scala version of code -

%scala
purchaseByCustomerPerHour.writeStream
.format(“memory”) // memory = store in-memory table
.queryName(“customer_purchases”) // counts = name of the in-memory table
.outputMode(“complete”) // complete = all the counts should be in the table
.start()

On 4th line, the comments for ".queryName() method" -
Original:
// counts = name of the in-memory table

Rectified:
// customer_purchases = name of the in-memory table

Thanks,
Manish Bahrani

Note from the Author or Editor:
This was fixed for the first release.

Manish Bahrani  Jul 05, 2017  Feb 08, 2018
Printed
Page 35
Scala code at the bottom

The code is missing a "sort descending". It is implied this was present at some point, both from the import and from the results on the next page (which you only get if you apply a sort), but it is no longer in either the Scala or the Python code.

The code should be this :
staticDataFrame
.selectExpr(
"CustomerID",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate")
.groupBy(
col("CustomerID"), window(col("InvoiceDate"), "1 day"))
.sum("total_cost")
.withColumnRenamed("sum(total_cost)","daily_total")
.sort(desc("daily_total"))
.show(5)

Note from the Author or Editor:
Re-reading, I'm not sure exactly where the sort should be. I see your point but don't think it's 100% necessary for the point that we're getting across. I think we should just remove the sort from the import statements.

import org.apache.spark.sql.functions.{window, column, desc, col}
should be come
import org.apache.spark.sql.functions.{window, col}

Tom Geudens  Apr 10, 2018 
Printed
Page 37
middle code block

The code blocks for both Scala and Python define a purchaseByCustomerPerHour. Which is very specific, but the window function used states window(col("InvoiceDate"), "1 day"). Now I'm not a specialist on the Spark function-set yet, but based on what I read there I would say it should be PerDay and not PerHour ?

Also, using col("InvoiceDate") in one example and $"InvoiceDate" in the next without explanation is confusing (sure, they both probably mean the same, but this is page 37 ... we're not specialists yet).

Note from the Author or Editor:
InvoiceDate is a timestamp column so per hour is correct (but completely understand where you're coming from).

As for the dollar signs, you're right - we talk about those in a later chapter but should probably properly introduce them. Sorry about that. We'll change them to col("InvoiceDate") to help with a bit more clarity at this point.

Tom Geudens  Apr 10, 2018 
Printed
Page 44
Last two lines

'The only difference will by syntax' should read 'The only difference will be syntax'

Note from the Author or Editor:
Yes, this is correct. We should change this.

Elias Strehle  Mar 28, 2018 
PDF,
Page 61
Last Section

Hi,

There is a Typo error in the first line on Pg-61 under section "Creating Rows".

Original:
You can create rows by manually instantiating a Row object with the values that below in each column.

Rectified:
You can create rows by manually instantiating a Row object with the values that belong in each column.

Thanks,
Manish Bahrani

Note from the Author or Editor:
This typo was fixed for the first release.

Manish Bahrani  Jul 05, 2017  Feb 08, 2018
Printed
Page 74
Changing a Column's Type (cast)

The count-column is actually already of the LongType (which you show on page 60). So it may make more sense to cast("integer").

Note from the Author or Editor:
Nice catch.

I think to make this even more clear, we should change the code block and that paragraph.

Let's change:
For instance, let’s convert our count column from an `Integer` to a `String`:

df.withColumn("count2", col("count").cast("string"))

-- in SQL
SELECT *, cast(count as string) AS count2 FROM dfTable

Tom Geudens  Apr 15, 2018 
Printed
Page 90
second code block

The describe method will actually compute statistics on almost any column, not just numeric ones. The df.describe.show() also shows results for Country and Descripition (string), but not for the InvoiceDate (timestamp). This is also reflected if you select this columns :
scala> df.select("Description").describe().show()
+-------+--------------------+
|summary| Description|
+-------+--------------------+
| count| 3098|
| mean| null|
| stddev| null|
| min| 4 PURPLE FLOCK D...|
| max|ZINC WILLIE WINKI...|
+-------+--------------------+

scala> df.select("InvoiceDate").describe().show()
+-------+
|summary|
+-------+
| count|
| mean|
| stddev|
| min|
| max|
+-------+

Note from the Author or Editor:
This may have been a more recent change because what was displayed was what shown for me when I ran the code.

In the paragraph before, let's change "all numeric columns" to just say "relevant columns".

Also, after the following sentence "This will take all numeric columns and calculate the count, mean, standard deviation, min, and max."

Let's add:

"This schema may change over time as new types are supported, don't depend too heavily on this schema (or behavior)."

Tom Geudens  Apr 17, 2018 
Printed
Page 97
7th line in code block

The 7th line in the '# in python' code block at the top of the page contains an undefined variable 'c'. This should be 'color_string' instead:

'.alias("is_" + color_string)'

Note from the Author or Editor:
Yes, you are correct! We should make this change.

Elias Strehle  Mar 28, 2018 
Printed
Page 98
1st sentence after code block

'Although Spark will do read dates or times on a best-effort basis' should read 'Spark will read dates or times on a best-effort basis'

Note from the Author or Editor:
Read/do should be "parse" in the future. This is good feedback.

Elias Strehle  Mar 28, 2018 
Printed
Page 102
Last paragraph, 5th sentence

'When we declare [...] not having a null time [...]' should read 'When we declare [...] not having a null type [...]'

Note from the Author or Editor:
We should make this change.

Elias Strehle  Mar 28, 2018 
Printed
Page 122
sumDistinct code block

The SQL statement for sumDistinct is not correct as the DISTINCT keyword is missing, it should be
scala> spark.sql("""SELECT sum(DISTINCT Quantity) FROM dfTable""").show()
+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
| 29310|
+----------------------+

Note from the Author or Editor:
You're correct, we need to add the DISTINCT keyword to that SQL statement under the sumDistinct heading.
It should state
SELECT SUM(DISTINCT Quantity) FROM dfTable -- 29310

instead of
SELECT SUM(Quantity) FROM dfTable -- 29310

Tom Geudens  Apr 24, 2018 
129
SQL query of subDistinct

sumDistinct example in SQL format require correction.

SELECT sum( distinct Quantity) FROM dfTable

Note from the Author or Editor:
This is correct, however it's on page 122. It should be fixed there. Searching for "SELECT sum(Quantity) FROM dfTable" will show you the right location.

Amit Kumar  Nov 15, 2018 
Printed
Page 131
first line of python code

The piece of code should clear nulls, but the .na has not been included.

the line:
dfNoNull = dfWithDate.drop()

should be:
dfNoNull = dfWithDate.na.drop()

Note from the Author or Editor:
Think this might have been fixed already but if not, please fix it.

Jonathan Wharton  Jan 12, 2019 
Printed
Page 155
Last paragraph, 3rd sentence

"format is optional because by default, Spark will use the arquet format." should read "format is optional because by default, Spark will use the parquet format.".

Note from the Author or Editor:
This fix is correc!

Anonymous  Jan 19, 2019 
Printed
Page 194
last set of SQL code on page

SELECT * FROM flights
WHERE origin_country_name IN (SELECT dest_country_name FROM flights
GROUP BY dest_country_name ORDER BY sum(count) DESC LIMIT 5)

should actually be:

SELECT * FROM flights
WHERE origin_country_name IN (SELECT dest_country_name FROM flights
GROUP BY dest_country_name ORDER BY sum(count) DESC) LIMIT 5

i.e. right parenthesis is in wrong place.

Note from the Author or Editor:
Please fix this, as described!

Jonathan Wharton  Jan 15, 2019 
Printed
Page 212
Last sentence

'You get the both of best worlds.' I think the incorrect is order.

Note from the Author or Editor:
Absolutely. :)

Elias Strehle  Mar 28, 2018 
Printed
Page 229
First paragraph in section 'Understanding Aggregation Implementations'

'We'll do these in the context of a key, but the same basic principles apply to the groupBy and reduce methods' should read 'We'll do these in the context of a key, but the same basic principles apply to the groupByValue and reduceValue methods'

Note from the Author or Editor:
This is probably a fair criticism if we're referring explicitly to the "method calls" instead of just the "method of implementation". We should clean these up to make sure they're consistent and your rewrite is probably a great start.

Elias Strehle  Mar 28, 2018 
Printed
Page 245
1st paragraph of subsection 'Custom Accumulators'

'In this example, you we will add [...]' should contain either 'you' or 'we', not both

Note from the Author or Editor:
Thanks for this feedback, we'll make the change!

Elias Strehle  Mar 28, 2018 
Printed
Page 256
2nd paragraph, last word

'Appication' should read 'Application'

Note from the Author or Editor:
Yes, it should!

Elias Strehle  Mar 29, 2018 
Printed
Page 257
Info box, 6th sentence

'communtiy' should read 'community'

Note from the Author or Editor:
Yes, please fix.

Elias Strehle  Mar 29, 2018 
Printed
Page 272
3rd paragraph, 1st sentence

'When submitting applciations, [...]' should read 'When submitting applications, [...]'

Note from the Author or Editor:
Yes, please fix.

Elias Strehle  Mar 29, 2018 
Printed
Page 276
1st and 2nd paragraph

The code block should be below the 2nd paragraph, not above, so the last sentence 'The example that follows [...]' becomes correct

Note from the Author or Editor:
Please change this to "The previous example configures..."

Elias Strehle  Mar 29, 2018 
Printed
Page 336
Subsection 'Real-time decision making', 2nd sentence

The last word 'fradulent' should read 'fraudulent'

Note from the Author or Editor:
Yes, it should!

Elias Strehle  Apr 03, 2018 
Printed
Page 339
1st paragraph, 2nd sentence

'[...] require deep expertise to be develop and maintain.' should read '[...] require deep expertise to be developed and maintained.'

Note from the Author or Editor:
Yes, let's make this change.

Elias Strehle  Apr 03, 2018 
Printed
Page 342
1st paragraph, 4th sentence

'[...] (all of its the windowing operators [...]' should read '[...] (all of its windowing operators [...]' or '[...] (all of the windowing operators [...]'

Note from the Author or Editor:
Let's change to "all of its windowing operators"

Elias Strehle  Apr 03, 2018 
Printed
Page 372
2nd paragraph, code block

The code block

'
spark.sql("SELECT * FROM events_per_window").printSchema()
SELECT * FROM events_per_window
'

contains two minor errors:
1) It should be '.show()' instead of '.printSchema()' to be consistent with the 3rd paragraph.
2) For Python, the code should reference 'pyevents_per_window' instead of 'events_per_window'.

Note from the Author or Editor:
Yes it should be ".show" I agree with 1).

However, for 2), we had to reduce the number of code blocks. It is fine as is and we hope readers will change it accordingly.

Elias Strehle  Apr 03, 2018 
Printed
Page 378
Section 'Arbitrary Stateful Processing', 1st sentence

'The first section if this chapter [...]' should read 'The first section of this chapter [...]'

Note from the Author or Editor:
Indeed!

Elias Strehle  Apr 03, 2018 
Printed
Page 381
General note

'[...] output of the dream [...]' is a lovely metaphor, but should probably read '[...] output of the stream [...]'

Note from the Author or Editor:
I almost want to leave it because it makes me smile. But yes, we should change this.

Elias Strehle  Apr 03, 2018 
Printed
Page 402
3rd Paragraph

The sentence, "O'Reilly should we link to or mention any specific ones?" is left in the text.

Note from the Author or Editor:
Yes, we should remove this sentence.

Anonymous  Mar 27, 2018 
Printed
Page 437
Subsection 'Advanced bucketing techniques', 1st sentence

'descriubed' should read 'described'

Note from the Author or Editor:
It should!

Elias Strehle  Apr 03, 2018 
Printed
Page 462
Subsection 'Multilabel Classification', 4th sentence

'Another example of multilabel classification is identifying the number of objects that appear in an image.'

This is not true: Predicting the number of objects is neither a multilabel problem (since only one number is predicted for an image) nor a classification problem (since there are infinitely many possible values).

The sentence could be replaced by the following: 'Another example of multilabel classification is identifying the objects that appear in an image.'

Note from the Author or Editor:
The wording is a bit imprecise and I agree with your proposed correction.

Elias Strehle  Apr 03, 2018 
Printed
Page 518
1st paragraph, 3rd sentence

'[...] combine motif finding with DataFarme queries [...]' should read '[...] combine motif finding with DataFrame queries [...]'

Note from the Author or Editor:
You are correct! We will make this change!

Elias Strehle  Apr 04, 2018