# Errata

## Errata for Advanced Analytics with Spark

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Safari Books Online
Chapter 9 Section Running the Trials

val trials = seedRdd.flatMap(trialReturns(_, numTrials / parallelism, bFactorWeights.value, factorMean

fails with:

Note from the Author or Editor:
More discussion is in https://github.com/sryza/aas/issues/64 including a potential workaround. Let's move there.

Ranko Mosic  Mar 08, 2016
PDF
Page 13
2nd Paragraph, curl command

The curl command used is curl -o donation.zip http://bit.ly/1Aoywaq. bit.ly responds with a 301, which curl on my system (curl 7.37.1 (x86_64-apple-darwin14.0), by default does not follow. To alleviate this, the command used should be curl -L -o donation.zip http://bit.ly/1Aoywaq

Note from the Author or Editor:

whaley  Apr 10, 2015  Aug 07, 2015
PDF
Page 28
5th line from bottom

In the text: "... decompressing and then serializing the results, and finally, performing computations on the aggregated data", the word "serializing" should be "deserializing".

Sean Owen

May 11, 2015  Aug 07, 2015
PDF
Page 33
The code snippet on top of the page

The code snippet reads: val misses = parsed.filter($"is_match" === false), the false value isn't wrapped in the lit() function. In the paragraph below where this code is talked through, the author claims that "we need to wrap the boolean literal false with the lit function". This needs clarification, is the code snippet missing a lit( ) around false or the paragraph below is incorrect and we don't in fact need a lit function in there? Note from the Author or Editor: Hm, are you sure? that seems to compile fine. The Column class defines an === method that takes "Any" and wraps its arg in lit() anyway. However I agree that the text says it's necessary, but I'm not sure it is. I'm not sure I'm able to update the text of this chapter anymore, but I would indeed make these consistent one way or the other. Jacek Jankowiak Jan 23, 2021 ePub Page 38% 1 Hi, The Wikipedia files appear to be corrupt:$ curl -s -L http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -cd | ~/hadoop-2.7.0/bin/hadoop fs -put - wikidump.xml

bzip2: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

Same thing when I try to download and extract them from Firefox:

ubuntu@ip-10-0-1-186:/data$bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2 bzip2: Data integrity error when decompressing. Input file = enwiki-latest-pages-articles-multistream.xml.bz2, output file = enwiki-latest-pages-articles-multistream.xml It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bzip2: Deleting output file enwiki-latest-pages-articles-multistream.xml, if it exists. ubuntu@ip-10-0-1-186:/data$ ls
enwiki-latest-pages-articles-multistream.xml.bz2 lost+found
ubuntu@ip-10-0-1-186:/data$bzip2 -dtvv enwiki-latest-pages-articles-multistream.xml.bz2 enwiki-latest-pages-articles-multistream.xml.bz2: [1: huff+mtf rt+rld] [1: huff+mtf rt+rld] [2: huff+mtf rt+rld] [1: huff+mtf rt+rld] [2: huff+mtf rt+rld] [3: huff+mtf rt+rld] [4: huff+mtf rt+rld] [5: huff+mtf rt+rld] [1: huff+mtf data integrity (CRC) error in data You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ubuntu@ip-10-0-1-186:/data$ bzip2recover enwiki-latest-pages-articles-multistream.xml.bz2
bzip2recover 1.0.6: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
block 1 runs from 80 to 4640
block 2 runs from 4808 to 1948251
block 3 runs from 1948300 to 3752034
block 4 runs from 3752200 to 5832866
block 5 runs from 5832915 to 7818462
block 6 runs from 7818511 to 9886990

...

Any ideas?

Note from the Author or Editor:
Hm, you're right. It seems like the dumps starting with April 3 have this problem. March 4 seems OK. We should change the text to refer to that specific version.

On page 102, the URL should change latest -> 20150304 in two places:

$curl -s -L http://dumps.wikimedia.org/enwiki/20150304/\$ enwiki-20150304-pages-articles-multistream.xml.bz2 \
...

David Laxer  May 26, 2015  Aug 07, 2015
Printed, PDF
Page 43
lines 4 and 5

1. “Collaborative Filtering for Implicit Feedback Datasets”
shortener: http://bit.ly/1ALoX4q which goes to: https://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf which is now 404.

The paper can now be found here:
http://yifanhu.net/PUB/cf.pdf

2. “Large-scale Parallel Collaborative Filtering for the Netflix Prize”
shortener http://bit.ly/16im1AT which now goes to: https://www.labs.hpe.com/about

The paper can be now found here:
https://endymecy.gitbooks.io/spark-ml-source-analysis/content/%E6%8E%A8%E8%8D%90/papers/Large-scale%20Parallel%20Collaborative%20Filtering%20the%20Netflix%20Prize.pdf

Note from the Author or Editor:
The links should be updated for the 2nd edition. Both links are correct in the draft PDF I am looking at now. The second link for "Large-scale Parallel Collaborative Filtering for the Netflix Prize" goes to http://dl.acm.org/citation.cfm?id=1424269 instead now.

Clem Wang  Jun 09, 2017
PDF
Page 62
2nd paragraph from the bottom

The description of the logic of the middle decision tree node is incorrect, and the text should update to match the diagram.

Replace this sentence:

If the date has passed by more than three days, I predict yes, it’s spoiled.

with:

If the date has passed, but that was three or fewer days ago, I take my chances and predict it's not spoiled.

Sean Owen

May 11, 2015  Aug 07, 2015
PDF
Page 71
Start of final paragraph

The paragraph should start with "The decision tree algorithm" but starts with "he decision tree algorithm".

Sean Owen

May 11, 2015  Aug 07, 2015
Printed, PDF
Page 72
2nd equation (6th from bottom)

the term log(1/p)

is missing the subscript i for p

In LaTex, it should be:

$$I_{E}(p) = \sum_{i=i}^{N}p_i log(\frac{1}{p_i}) = - \sum_{i=i}^{N}p_i log(p_i)$$

Note from the Author or Editor:
Yes, you're right. I'll fix that for future printing.

Clem Wang  Jun 10, 2017
PDF
Page 92
Very end, continuing into page 93

From a reader report at https://github.com/sryza/aas/issues/33 :

On page 92 in calculating sumSquares, the code

val sumSquares = dataAsArray.fold(
new Array[Double](numCols)
)(
(a,b) => a.zip(b).map(t => t._1 + t._2 * t._2)
)
As the RDD.fold requires operator to be communicative, which was violated by asymmetry in the map() function, the result might be different for different number of partitions in RDD.

Yes, this code should be replaced with a call to aggregate:

val sumSquares = dataAsArray.aggregate(
new Array[Double](numCols)
)(
(a, b) => a.zip(b).map(t => t._1 + t._2 * t._2),
(a, b) => a.zip(b).map(t => t._1 + t._2)
)

Sean Owen

Jul 17, 2015  Aug 07, 2015
Printed
Page 100
Middle of 3rd paragraph

There appears to be confusing inconsistency in the assignment of rows and columns to terms and documents. In the middle of page 100 and the line passing from the bottom of page 100 to the top of page 101, rows represent terms and columns represent documents. But in the 5th paragraph of page 101 and through to the end of the chapter, rows are documents and columns are terms. See in particular the passage from the bottom of page 107 to the top of page 108.

Note from the Author or Editor:
Agree. I will forward to Sandy for a look. I think it may be best to change all references to refer to a "document-term" matrix where docs are rows.

John Boersma  Nov 04, 2015
PDF
Page 104
Last line of code on page

The code snippet refers to the file "stopwords.txt", but doesn't say where this file comes from. It is available at https://github.com/sryza/aas/blob/master/ch06-lsa/src/main/resources/stopwords.txt and this should be explicit in the text.

To address this, in the text that precedes the listing, after the sentence "The following snippet takes the RDD of plain-text documents and both lemmatizes it and filters out stop words:", instead end that sentence with a period and add the sentence:

Note that this code relies on a file of stopwords called stopwords.txt, which is available in the accompanying source code repo at https://github.com/sryza/aas/blob/master/ch06-lsa/src/main/resources/stopwords.txt and should be downloaded into the current working directory first:

Sean Owen

Jul 09, 2015  Aug 07, 2015
PDF
Page 104, 107
Code listings in each page

See https://github.com/sryza/aas/issues/34

in RunLSA.scala

error: value containsKey is not a member of scala.collection.immutable.Map[String,Int]
case (term, freq) => bTermToId.containsKey(term)

http://www.scala-lang.org/api/2.11.5/index.html#scala.collection.immutable.Map

looks like it should be "contains" instead of "containsKey"

On page 104, the following import needs to be added at the start of the code listing:

import scala.collection.JavaConversions._

On page 107, the same import can be removed from the listing. In addition, in that listing termFreqs.values().sum should become termFreqs.values.sum, and bTermIds.containsKey(term) should become bTermIds.contains(term)

Sean Owen

Jul 17, 2015  Aug 07, 2015
Printed
Page 107
United States

Just a suggestion.
For completeness, might be worth adding line:

though this is easily extrapolated by a reader who follows the code and/or can be looked up in the accompanying repo online.

Note from the Author or Editor:
Since the text is explaining the computation and broadcast of one data structure, bTermIds, I'd rather not inject a second one there.

However I don't think it would hurt to add a little text here as it does sort of feel like the next chunk of code should be executable as-is, but this necessary second broadcast is not mentioned. It is the accompanying source.

Before "Finally, we tie it all together ...", add "Similarly, broadcast idfs as bIdfs." Code font for idfs and bIdsf.

Renat Bekbolatov  May 11, 2015  Aug 07, 2015
PDF
Page 107
Code listing at top of page

The listing at the top of the page does not define numDocs. See https://github.com/sryza/aas/issues/31

The suggested fix is to insert this line of code before the first line of this listing (beginning "val idfs =..."):

val numDocs = docTermFreqs.count()

Also, this listing needs a different small fix. The "toMap" at the end needs to be "collectAsMap()"

Sean Owen

Jul 15, 2015  Aug 07, 2015
PDF, ePub, Mobi
Page 107
SVD definition

The original text says:
V is a k x n matrix ...

It should be:
V is a n x k matrix ...

Explanation
If V is a k x n matrix, its transpose is a n x k matrix,
and the matrix multiplication U S Vt is not possible (unless n = k).
U S Vt
(m x k) x (k x k) x (n x k)

But if V is a n x k matrix, its transpose is a k x n matrix, and everything is ok!
U S Vt
(m x k) x (k x k) x (k x n)

Note from the Author or Editor:
That's correct. In the third bullet point following the equation M = U S VT, it should start with "VT is a k x n matrix ..." (that's V with superscript T)

Carlos Pavia  Feb 07, 2015  Aug 07, 2015
PDF
Page 108
Code listing in middle of page

See https://github.com/sryza/aas/issues/36

termDocMatrix is not defined. It should actually be "vecs", defined on the previous page.

Sean Owen

Jul 20, 2015  Aug 07, 2015
Printed
Page 109
code block in the middle, 4th line from bottom

If reader only uses the book, variable "termId" will be misleading there, because it was earlier defined to be of type Map String->Int, whereas now it is the reverse of that.

Again, this could be found in the online repo and guessed, but readers will find it useful to have consistent variable names if they are only reading the book.

Note from the Author or Editor:
Yeah, they are different code blocks, and it's maybe clearer in the full source code, which works, but this does result in a conflict in listings.

I think the simplest clarification involves changing the previous code on page 107. Replace "termIds" with "termToId", and "bTermIds" with "bTermToId" on page 107. Then page 109 has no conflict.

I can make a parallel change in the source code in the repo.

Renat Bekbolatov  May 11, 2015  Aug 07, 2015
Printed
Page 109
last paragraph, 2nd~3rd line

We can find the terms relevant to each of the top concepts in a similar manner using Y, ...
->
We can find the documents relevant to each of the top concepts in a similar manner using Y, ...

Is it right?

Note from the Author or Editor:
Agreed, the text clearly shows using V for terms, and then U for documents. Your change is correct.

Edberg  Dec 29, 2015
Printed
Page 111
def wikiXml...

missing "None" line

Note from the Author or Editor:
Yes, after the line in the listing:

page.getTitle.contains("(disambiguation)")) {

and before

} else {

should appear a new line containing just:

None

It should be indented like "Some" below.

Renat Bekbolatov  May 11, 2015  Aug 07, 2015
PDF
Page 114, 117, 118
Various code listings

It seems like the variables idTerms and termIds may be used inconsistently in the code listings and in the accompanying source code. The idTerms variable is not shown in the listings in the book. Corresponding listings in the source code use termIds.

It seems like idTerms should be a map from id to terms, and termIds should be terms to ids, but the convention is largely reversed. Is that intentional? that's fine if so. But either way it needs to be consistent in the code / listing.

Sean Owen

Jul 09, 2015  Aug 07, 2015
Printed
Page 134
First line of second code block on the page

Two typos - or - One typo, if another line added. Explanation below.

A:

1) we probably want to use variable name "componentCounts" instead of "topComponentCounts" which was not introduced.

2) we should be looking up "componentCounts(1)._1" instead of "componentCounts(1)._2"

B:

It is possible that there was a missing line - but it is also not in the official repo:

val topComponentCounts = componentCounts.take(10).map(_.swap)

Note from the Author or Editor:
Yes, the best fix is to change

topComponentCounts(1)._2

to

componentCounts(1)._1

on page 134. I will fix the source code repo too.

Could we also add Renat Bekbolatov to the acknowledgements section as part of this erratum? Lots of good catches and deserves recognition.

Renat Bekbolatov  May 15, 2015  Aug 07, 2015
Printed
Page 134
second line above

while the second largest contains only 4% -> containts only 4

Is it right?

Note from the Author or Editor:

Edberg  Jan 02, 2016
PDF
Page 136
.

There is an error on pg. 136 of the latest release of Advanced Analytics
with Spark.

The marginal totals on the contingency table- the bottom row total should
be "A total" rather than "B total"

Note from the Author or Editor:
Agree, though I believe it's the third column label of this table (on p. 138 in the final PDF) that should be "A Total" rather than third row label, given the following text.

Anonymous  Feb 19, 2015  Aug 07, 2015
Printed
Page 139
3rd line from bottom

"val inner = (YY * NN - YN * NY) - T / 2.0"

->

"val inner = YY * NN - YN * NY"

Note from the Author or Editor:
Yes, this code is inconsistent with the formula on the preceding page, page 138 at the bottom. However I think we should change the formula. The extra term here is the Yates continuity correction and is probably the right version of the chi-squared test to show people.

So, on page 138, the numerator of the formula should add two things: absolute value around the current inner product, and then a "- T / 2" term, to read:

(|YY * NN − YN * NY| - T / 2)^2

Immediately following, before the sentence "If our samples are...", we should insert a clarifying remark:

Note that this formulation of the chi-squared statistic includes a term "- T / 2", which is Yates's continuity correction (http://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity) and not included in some formulations of the chi-squared statistic.

Then this line of code on 139:

val inner = (YY*NN-YN*NY) - T / 2.0

needs to be

val inner = math.abs(YY*NN-YN*NY) - T / 2.0

Renat Bekbolatov  May 15, 2015  Aug 07, 2015
PDF
Page 156
Code segment at end of page

The code has some typos in it:
the distance() method is missing the beginning curly brace and the call inside it to GeometryEngine.distance() is missing the ending paren

Note from the Author or Editor:
Yes, the distance function should read:

def distance(other: Geometry): Double = {
GeometryEngine.distance(geometry, other, spatialReference)
}

Chaz Chandler  Apr 19, 2015  Aug 07, 2015
PDF
Page 159
Code block at beginning of page

The code blocks described in this section are only a portion of the code necessary. If the intent of the text is for the reader to follow along in the console while reading then the reader will be stuck without referencing the full code on Github (https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/GeoJson.scala). It may be useful to point out that this is just an illustrative excerpt, since adding the full code listing may unnecessarily add to the length of the text. Also the implicit declarations in the text aren't nested within the GeoJsonProtocol object like they are in the Github code.

Note from the Author or Editor:
I agree. At "Esri Geometry API to represent the longitude and latitude of the pickup and dropoff locations:", let's remove the colon and finish the sentence with a period, then add, "Note that the code listings below are only illustrative extracts from the complete code that you will need to execute to follow along with this chapter. Please refer to the accompanying Chapter 8 source code repository, in particular GeoJson.scala."

Chaz Chandler  Apr 19, 2015  Aug 07, 2015
PDF
Page 165
first line of top code block

The first import statement should read "import com.cloudera.datascience.geotime._" (ie, datascience instead of science and geotime instead of geojson). See https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/GeoJson.scala#L6

Note from the Author or Editor:

Chaz Chandler  Apr 19, 2015  Aug 07, 2015
Printed
Page 165
(1) First sentence, (2) 5th line from below

A couple of minor suggestions:

(1) First sentence:
"...we need to use ... tools ... in[to] the Spark shell ..."

(2) 5th line from below:

It is not exactly clear what "frs sequence" stands for, but of course, from context we can guess it is about areaSortedFeatures.

Note from the Author or Editor:
On page 165, opening sentence, change "Now we need to use" to "Now we need to import"

Page 165, "in the frs sequence" should be "in the areaSortedFeatures sequence"

Renat Bekbolatov  May 19, 2015  Aug 07, 2015
Printed
Page 170
First line of code block on the bottom of the page

Small typo:

previously unseen "bdrdd" -> "boroughDurations"

Note from the Author or Editor:
Correct, bdrdd should read boroughDurations in the final snippet on 170.

Renat Bekbolatov  May 19, 2015  Aug 07, 2015
Printed
Page 177
code segment at the bottom

Just a tiny thing, might be worth noting.
Using the current version of codebase and the way it is described in the book, the script actually gets different instruments - not ones that track S&P 500 or Nasdaq index values. (That might also explain unexpected lower correlation numbers between these two on page 185.)

Also a couple of other very minor typos to fix in future prints:
p. 175, "Variance-Covariance" section, last words: "...deriving a[n] estimate ..."
p. 176, "Our Model" section, end of first paragraph: "... of possibl[e|y] different ..."

Note from the Author or Editor:

Renat Bekbolatov  May 21, 2015  Aug 07, 2015
Printed, PDF
Page 180
Bottom two code snippets

Errors:
val stocks: Seq[Array[Double]] =
Should be:
val stocks: Seq[Array[(DateTime, Double)]]

Error
val factors: Seq[Array[Double] =
Should be
val factors: Seq[Array[(DateTime, Double)]] =

Note from the Author or Editor:
Yes, the type is incorrect and should be Seq[Array[(DateTime, Double)]] in both cases. In the accompanying source code, there's no problem since the type is simply left off. I think it's best to just match that.

The type declaration ": Seq[Array[Double]]" can be deleted in both occurrences, leaving just "val stocks = ..." and "val factors = ..."

Dr Zach Izham  May 15, 2015  Aug 07, 2015
Printed
Page 195
Quote, Ch. 10

What is SCHPON[...]?

Note from the Author or Editor:
It's shorthand for Sulfur, Carbon, Hydrogen, Phosphorous, Oxygen, Nitrogen. I'll try to stick in a footnote.

Edberg  Feb 15, 2016
PDF
Page 199
Last command on page

https://github.com/sryza/aas/issues/38

The URL below doesn't work anymore:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

It appears to now be at:

ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

There's a UK mirror at:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

Sean Owen

Jul 24, 2015  Aug 07, 2015
PDF
Page 199
First line of code

The "adamLoad" method used in the text and source was removed after adam 0.16.0, which the text uses. The source repo depends on 0.16.0, but here in the "git clone" command it's important to also checkout 0.16.0:

Sean Owen

Aug 03, 2015
PDF
Page 199
2nd block of code

should be

or

This works in bash, sh and ksh

In tcsh or csh it should be

Note from the Author or Editor:
Yes, the "export ADAM_HOME=..." is correct.

David G Pisano  Sep 22, 2015
PDF
Page 222

The location of the resource hyperlinked as python/thunder/utils/data/fish/tif-stack has changed. It should be:

https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack

Sean Owen

May 01, 2015  Aug 07, 2015
PDF
Page 240
2nd & 3rd paragraph

Thanks for fixing the Accumulator docs.

2nd paragraph (the new one): "but its output are lots" -> lost

3rd paragraph: This paragraph is very hard to understand and parse for me. Especially now that the previous paragraph basically says the opposite.

Note from the Author or Editor:
Indeed lots -> lost needs to be fixed.
Sandy up to you whether you want to revise the 3rd paragraph. The link may be: "For cases where these behaviors are acceptable, accumulators can be a big win, because ..."

Lars Francke  Aug 18, 2015
PDF
Page 240
2nd paragraph

"A task only contributes to the accumulator the first time it runs. For example, if a task completes successfully, but its outputs are lost and it needs to be rerun, it will not increment the accumulator again."

That's wrong. It is only true for actions. When using Accumulators in transformations they may be incremented multiple times. This can happen for various reasons (reuse of RDD, task failure, task rerun etc.)

Note from the Author or Editor:
Yes I think this at least has to be converted to a note explaining that this could over-count as implemented here. Better would be to convert the example to make the accumulator updates inside aggregate() instead somehow, as that's correct. Sandy WDYT?

Lars Francke  Jul 30, 2015  Aug 07, 2015
Printed
Page 242
last line before section beginning

To make it work by default (without implicit conversions):

val (train, test) = ...

to

val Array(train, test) = ...

Note from the Author or Editor:
Yes, this needs to be

val Array(train, test) = ...

to work "out of the box". This line of code wasn't actually in the repo as it was just an example, so didn't catch it as a compiler error.

Renat Bekbolatov  May 11, 2015  Aug 07, 2015
Printed
Page 242
first line

from "Swap the order of the tuples to sort on the numbers instead of the counts"

I think that example code in previous page sorts data on the counts instead of the numbers. Am I incorrect?

Note from the Author or Editor:
You're right that this is what the example on the previous page does, and it's also what this example does. Really, the bullet text should be different, to explain that it's doing this to still sort on count. I will adjust it.

Edberg  Feb 12, 2016
ePub
Page 314
1st paragraph in "Document-Document Relevance"

"where u sub-i is the row in U corresponding to term i". I think that "term" should be "document", since U is the document space.

Note from the Author or Editor:
Yes, it should read "document i".

Brent Schneeman  Apr 14, 2015  Aug 07, 2015