Errata

Advanced Analytics with Spark

Errata for Advanced Analytics with Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF Page ch 9
overall

API from Google Finance is not valid now, so every example cannot be tested because I cannot get the data.

Anonymous  Jan 16, 2018 
PDF Page 8
Last paragraph, second last line

"preserve the book as an useful resource"

should instead be

"preserve the book as a useful resource"

Anonymous  May 12, 2017 
Printed Page 17
last paragraph

Typo linakge instead of linkage

val rawblocks = sc.textFile("linakge")
val rawblocks = sc.textFile("linkage")

instead of

val rawblocks = sc.textFile("linkage")
val rawblocks = sc.textFile("linkage")

Anonymous  Feb 20, 2019 
PDF Page 29
second code bundle

parsed.
groupBy("is_match").
count().
orderBy($"count".desc)
show()

-->

parsed.
groupBy("is_match").
count().
orderBy($"count".desc).
show()

'.' is omitted.

Anonymous  Oct 09, 2017 
PDF Page 31
2nd last paragraph

"option of treating the any DataFrame that we create"

the word "the" should be removed.

Anonymous  May 15, 2017 
PDF Page 34
2nd paragraph

in "that would be valid inside of a WHERE clase in Spark SQL"

"clase" should be "clause"

Anonymous  May 15, 2017 
PDF Page 35
first tip

"isn't comprised of"

should instead be

"doesn't comprise"

Anonymous  May 13, 2017 
PDF Page 82
right above the code sample

"Here, MulticlassMetrics is perfectly usage with a DataFrame containing predictions."

should needs to be fixed

Anonymous  May 16, 2017 
PDF Page 86
first equation

One of the p variables is missing a subscript i

Anonymous  May 17, 2017 
PDF Page 137
Overall CH 7

I read both of 1st and 2nd edition of this book.

In 2nd edition, sample data is changed (from 2014 version to 2016 version). but some test result still remain in older version's.

For example, in page 144, "there are more than 13,000 different major topics in our data set".... but, 13000 is older result of 1st edition. Of course, 14548 is also more than 13000, but there are more such mistakes.

3rd paragraph of page 148, "which only has 13,000 vertices in the graph" -> "which only has 14,500 vertices in the graph"

last paragraph of page 153, 13034 and 12065 -> 14548 and 13721

I may not find rest of such mistake.

Anonymous  Oct 08, 2017 
PDF Page 143
3rd code piece

def majorTopics(record: String)={...}
majorTopics(elem)

elem is not a type of String, so I think elem.toString() or rawXml is right here.

Anonymous  Oct 10, 2017 
PDF Page 151
1st paragraph, 2nd line

contains only 4 vertices -> contains only 5 vertices

Anonymous  Oct 11, 2017 
PDF Page 151
code pieces in this page

val topicComponentDF = topicGraph.vertices.innerJoin(
connectedComponentGraph.vertices) {
(topicId, name, componentId) => (name, componentId.toLong)
}.toDF("topic", "cid")

code does not work properly. In result dataframe, values of cid are located in the topic column.

Anonymous  Oct 11, 2017 
PDF Page 151
2nd paragraph

"Let’s take a look at the topic names for the largest connected component that wasn’t a part of the giant component:"

But your example is not the second largest connected component, but third largest.

Anonymous  Oct 11, 2017 
PDF Page 151
code pieces in this page

by the code

val topicComponentDF = topicGraph.vertices.innerJoin(
connectedComponentGraph.vertices) {
(topicId, name, componentId) => (name, componentId.toLong)
}.toDF("topic", "cid")

generate DF with schema
topic: long
cid: struct
_1 : string
_2 : long

so, following query must be changed such
topicComponentDF.where("cid._2 = -2062883918534425492").show(false)

then result :
+--------------------+-----------------------------------------------+
|topic |cid |
+--------------------+-----------------------------------------------+
|-1870678893086276394|[Serotyping,-2062883918534425492] |
|-1233269114313988317|[Campylobacter coli,-2062883918534425492] |
|-2062883918534425492|[Campylobacter jejuni,-2062883918534425492] |
|4763791955467795057 |[Campylobacter Infections,-2062883918534425492]|
+--------------------+-----------------------------------------------+

Anonymous  Nov 05, 2017 
PDF Page 158
last subgraph (to the next page)

The mean degree for the original graph was about 43, and the mean degree for the filtered graph has fallen a bit, to about 28. More interesting, however, is the precipitous drop in the size of the largest degree vertex, which has fallen from 3,753 in the original graph to 1,603 in the filtered graph. If we look at the association between concept and degree in the filtered graph, we see this:

the numbers are values of example of 1st edition.

For the example of 2nd edition,
43 -> 31
28 -> 20
3753 -> 2596
1603 -> 863

Anonymous  Oct 11, 2017 
PDF Page 170
bottom code block & link in paragraph above

link to taxi trips data set should be https://storage.googleapis.com/aas-data-sets/trip_data_1.csv.zip

Marie Beaugureau  Oct 16, 2017 
PDF Page 197
3rd paragraph, last line

"a estimate" should instead be "an estimate"

Anonymous  May 19, 2017 
PDF Page 200
middle

"We can represent out dates as LocalDate objects"

"out" should be "our"

Anonymous  May 21, 2017