Errata for Advanced Analytics with Spark

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date Submitted Date Corrected
Safari Books Online
Chapter 9 Section Running the Trials

val trials = seedRdd.flatMap(trialReturns(_, numTrials / parallelism, bFactorWeights.value, factorMean fails with: org.apache.spark.SparkException: Task not serializable

Note from the Author or Editor:
More discussion is in https://github.com/sryza/aas/issues/64 including a potential workaround. Let's move there.

Ranko Mosic  Mar 08, 2016
PDF
Page 13
2nd Paragraph, curl command

The curl command used is curl -o donation.zip http://bit.ly/1Aoywaq. bit.ly responds with a 301, which curl on my system (curl 7.37.1 (x86_64-apple-darwin14.0), by default does not follow. To alleviate this, the command used should be curl -L -o donation.zip http://bit.ly/1Aoywaq

Note from the Author or Editor:

whaley  Apr 10, 2015  Aug 07, 2015
PDF
Page 28
5th line from bottom

In the text: "... decompressing and then serializing the results, and finally, performing computations on the aggregated data", the word "serializing" should be "deserializing".

Sean Owen

May 11, 2015  Aug 07, 2015
ePub
Page 38%
1

Hi, The Wikipedia files appear to be corrupt: $curl -s -L http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -cd | ~/hadoop-2.7.0/bin/hadoop fs -put - wikidump.xml bzip2: Data integrity error when decompressing. Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. Same thing when I try to download and extract them from Firefox: ubuntu@ip-10-0-1-186:/data$ bzip2 -d enwiki-latest-pages-articles-multistream.xml.bz2 bzip2: Data integrity error when decompressing. Input file = enwiki-latest-pages-articles-multistream.xml.bz2, output file = enwiki-latest-pages-articles-multistream.xml It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bzip2: Deleting output file enwiki-latest-pages-articles-multistream.xml, if it exists. ubuntu@ip-10-0-1-186:/data$ls enwiki-latest-pages-articles-multistream.xml.bz2 lost+found ubuntu@ip-10-0-1-186:/data$ bzip2 -dtvv enwiki-latest-pages-articles-multistream.xml.bz2 enwiki-latest-pages-articles-multistream.xml.bz2: [1: huff+mtf rt+rld] [1: huff+mtf rt+rld] [2: huff+mtf rt+rld] [1: huff+mtf rt+rld] [2: huff+mtf rt+rld] [3: huff+mtf rt+rld] [4: huff+mtf rt+rld] [5: huff+mtf rt+rld] [1: huff+mtf data integrity (CRC) error in data You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ubuntu@ip-10-0-1-186:/data$bzip2recover enwiki-latest-pages-articles-multistream.xml.bz2 bzip2recover 1.0.6: extracts blocks from damaged .bz2 files. bzip2recover: searching for block boundaries ... block 1 runs from 80 to 4640 block 2 runs from 4808 to 1948251 block 3 runs from 1948300 to 3752034 block 4 runs from 3752200 to 5832866 block 5 runs from 5832915 to 7818462 block 6 runs from 7818511 to 9886990 ... Any ideas? Note from the Author or Editor: Hm, you're right. It seems like the dumps starting with April 3 have this problem. March 4 seems OK. We should change the text to refer to that specific version. On page 102, the URL should change latest -> 20150304 in two places:$ curl -s -L http://dumps.wikimedia.org/enwiki/20150304/\ enwiki-20150304-pages-articles-multistream.xml.bz2 \ ... David Laxer May 26, 2015 Aug 07, 2015 Printed, PDF Page 43 lines 4 and 5 There are two dead links: 1. “Collaborative Filtering for Implicit Feedback Datasets” shortener: http://bit.ly/1ALoX4q which goes to: https://research.yahoo.com/files/HuKorenVolinsky-ICDM08.pdf which is now 404. The paper can now be found here: http://yifanhu.net/PUB/cf.pdf 2. “Large-scale Parallel Collaborative Filtering for the Netflix Prize” shortener http://bit.ly/16im1AT which now goes to: https://www.labs.hpe.com/about The paper can be now found here: https://endymecy.gitbooks.io/spark-ml-source-analysis/content/%E6%8E%A8%E8%8D%90/papers/Large-scale%20Parallel%20Collaborative%20Filtering%20the%20Netflix%20Prize.pdf Note from the Author or Editor: The links should be updated for the 2nd edition. Both links are correct in the draft PDF I am looking at now. The second link for "Large-scale Parallel Collaborative Filtering for the Netflix Prize" goes to http://dl.acm.org/citation.cfm?id=1424269 instead now. Clem Wang Jun 09, 2017 PDF Page 62 2nd paragraph from the bottom The description of the logic of the middle decision tree node is incorrect, and the text should update to match the diagram. Replace this sentence: If the date has passed by more than three days, I predict yes, it’s spoiled. with: If the date has passed, but that was three or fewer days ago, I take my chances and predict it's not spoiled. Sean Owen May 11, 2015 Aug 07, 2015 PDF Page 71 Start of final paragraph The paragraph should start with "The decision tree algorithm" but starts with "he decision tree algorithm". Sean Owen May 11, 2015 Aug 07, 2015 Printed, PDF Page 72 2nd equation (6th from bottom) the term log(1/p) is missing the subscript i for p In LaTex, it should be: $$I_{E}(p) = \sum_{i=i}^{N}p_i log(\frac{1}{p_i}) = - \sum_{i=i}^{N}p_i log(p_i)$$ Note from the Author or Editor: Yes, you're right. I'll fix that for future printing. Clem Wang Jun 10, 2017 PDF Page 92 Very end, continuing into page 93 From a reader report at https://github.com/sryza/aas/issues/33 : On page 92 in calculating sumSquares, the code val sumSquares = dataAsArray.fold( new Array[Double](numCols) )( (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2) ) As the RDD.fold requires operator to be communicative, which was violated by asymmetry in the map() function, the result might be different for different number of partitions in RDD. Yes, this code should be replaced with a call to aggregate: val sumSquares = dataAsArray.aggregate( new Array[Double](numCols) )( (a, b) => a.zip(b).map(t => t._1 + t._2 * t._2), (a, b) => a.zip(b).map(t => t._1 + t._2) ) Sean Owen Jul 17, 2015 Aug 07, 2015 Printed Page 100 Middle of 3rd paragraph There appears to be confusing inconsistency in the assignment of rows and columns to terms and documents. In the middle of page 100 and the line passing from the bottom of page 100 to the top of page 101, rows represent terms and columns represent documents. But in the 5th paragraph of page 101 and through to the end of the chapter, rows are documents and columns are terms. See in particular the passage from the bottom of page 107 to the top of page 108. Note from the Author or Editor: Agree. I will forward to Sandy for a look. I think it may be best to change all references to refer to a "document-term" matrix where docs are rows. John Boersma Nov 04, 2015 PDF Page 104 Last line of code on page The code snippet refers to the file "stopwords.txt", but doesn't say where this file comes from. It is available at https://github.com/sryza/aas/blob/master/ch06-lsa/src/main/resources/stopwords.txt and this should be explicit in the text. To address this, in the text that precedes the listing, after the sentence "The following snippet takes the RDD of plain-text documents and both lemmatizes it and filters out stop words:", instead end that sentence with a period and add the sentence: Note that this code relies on a file of stopwords called stopwords.txt, which is available in the accompanying source code repo at https://github.com/sryza/aas/blob/master/ch06-lsa/src/main/resources/stopwords.txt and should be downloaded into the current working directory first: Sean Owen Jul 09, 2015 Aug 07, 2015 PDF Page 104, 107 Code listings in each page See https://github.com/sryza/aas/issues/34 in RunLSA.scala error: value containsKey is not a member of scala.collection.immutable.Map[String,Int] case (term, freq) => bTermToId.containsKey(term) http://www.scala-lang.org/api/2.11.5/index.html#scala.collection.immutable.Map looks like it should be "contains" instead of "containsKey" On page 104, the following import needs to be added at the start of the code listing: import scala.collection.JavaConversions._ On page 107, the same import can be removed from the listing. In addition, in that listing termFreqs.values().sum should become termFreqs.values.sum, and bTermIds.containsKey(term) should become bTermIds.contains(term) Sean Owen Jul 17, 2015 Aug 07, 2015 Printed Page 107 United States Just a suggestion. For completeness, might be worth adding line: val bIdfs = sc.broadcast(idfs).value though this is easily extrapolated by a reader who follows the code and/or can be looked up in the accompanying repo online. Note from the Author or Editor: Since the text is explaining the computation and broadcast of one data structure, bTermIds, I'd rather not inject a second one there. However I don't think it would hurt to add a little text here as it does sort of feel like the next chunk of code should be executable as-is, but this necessary second broadcast is not mentioned. It is the accompanying source. Before "Finally, we tie it all together ...", add "Similarly, broadcast idfs as bIdfs." Code font for idfs and bIdsf. Renat Bekbolatov May 11, 2015 Aug 07, 2015 PDF Page 107 Code listing at top of page The listing at the top of the page does not define numDocs. See https://github.com/sryza/aas/issues/31 The suggested fix is to insert this line of code before the first line of this listing (beginning "val idfs =..."): val numDocs = docTermFreqs.count() Also, this listing needs a different small fix. The "toMap" at the end needs to be "collectAsMap()" Sean Owen Jul 15, 2015 Aug 07, 2015 PDF, ePub, Mobi Page 107 SVD definition The original text says: V is a k x n matrix ... It should be: V is a n x k matrix ... Explanation If V is a k x n matrix, its transpose is a n x k matrix, and the matrix multiplication U S Vt is not possible (unless n = k). U S Vt (m x k) x (k x k) x (n x k) But if V is a n x k matrix, its transpose is a k x n matrix, and everything is ok! U S Vt (m x k) x (k x k) x (k x n) Note from the Author or Editor: That's correct. In the third bullet point following the equation M = U S VT, it should start with "VT is a k x n matrix ..." (that's V with superscript T) Carlos Pavia Feb 07, 2015 Aug 07, 2015 PDF Page 108 Code listing in middle of page See https://github.com/sryza/aas/issues/36 termDocMatrix is not defined. It should actually be "vecs", defined on the previous page. Sean Owen Jul 20, 2015 Aug 07, 2015 Printed Page 109 code block in the middle, 4th line from bottom If reader only uses the book, variable "termId" will be misleading there, because it was earlier defined to be of type Map String->Int, whereas now it is the reverse of that. Again, this could be found in the online repo and guessed, but readers will find it useful to have consistent variable names if they are only reading the book. Note from the Author or Editor: Yeah, they are different code blocks, and it's maybe clearer in the full source code, which works, but this does result in a conflict in listings. I think the simplest clarification involves changing the previous code on page 107. Replace "termIds" with "termToId", and "bTermIds" with "bTermToId" on page 107. Then page 109 has no conflict. I can make a parallel change in the source code in the repo. Renat Bekbolatov May 11, 2015 Aug 07, 2015 Printed Page 109 last paragraph, 2nd~3rd line We can find the terms relevant to each of the top concepts in a similar manner using Y, ... -> We can find the documents relevant to each of the top concepts in a similar manner using Y, ... Is it right? Note from the Author or Editor: Agreed, the text clearly shows using V for terms, and then U for documents. Your change is correct. Edberg Dec 29, 2015 Printed Page 111 def wikiXml... missing "None" line Note from the Author or Editor: Yes, after the line in the listing: page.getTitle.contains("(disambiguation)")) { and before } else { should appear a new line containing just: None It should be indented like "Some" below. Renat Bekbolatov May 11, 2015 Aug 07, 2015 PDF Page 114, 117, 118 Various code listings From a reader report -- It seems like the variables idTerms and termIds may be used inconsistently in the code listings and in the accompanying source code. The idTerms variable is not shown in the listings in the book. Corresponding listings in the source code use termIds. It seems like idTerms should be a map from id to terms, and termIds should be terms to ids, but the convention is largely reversed. Is that intentional? that's fine if so. But either way it needs to be consistent in the code / listing. Sean Owen Jul 09, 2015 Aug 07, 2015 Printed Page 134 First line of second code block on the page Two typos - or - One typo, if another line added. Explanation below. A: 1) we probably want to use variable name "componentCounts" instead of "topComponentCounts" which was not introduced. 2) we should be looking up "componentCounts(1)._1" instead of "componentCounts(1)._2" B: It is possible that there was a missing line - but it is also not in the official repo: val topComponentCounts = componentCounts.take(10).map(_.swap) Note from the Author or Editor: Yes, the best fix is to change topComponentCounts(1)._2 to componentCounts(1)._1 on page 134. I will fix the source code repo too. Could we also add Renat Bekbolatov to the acknowledgements section as part of this erratum? Lots of good catches and deserves recognition. Renat Bekbolatov May 15, 2015 Aug 07, 2015 Printed Page 134 second line above while the second largest contains only 4% -> containts only 4 Is it right? Note from the Author or Editor: Yes, it should read "4 vertices" instead of "4%" for clarity. Edberg Jan 02, 2016 PDF Page 136 . There is an error on pg. 136 of the latest release of Advanced Analytics with Spark. The marginal totals on the contingency table- the bottom row total should be "A total" rather than "B total" Note from the Author or Editor: Agree, though I believe it's the third column label of this table (on p. 138 in the final PDF) that should be "A Total" rather than third row label, given the following text. Anonymous Feb 19, 2015 Aug 07, 2015 Printed Page 139 3rd line from bottom "val inner = (YY * NN - YN * NY) - T / 2.0" -> "val inner = YY * NN - YN * NY" Note from the Author or Editor: Yes, this code is inconsistent with the formula on the preceding page, page 138 at the bottom. However I think we should change the formula. The extra term here is the Yates continuity correction and is probably the right version of the chi-squared test to show people. So, on page 138, the numerator of the formula should add two things: absolute value around the current inner product, and then a "- T / 2" term, to read: (|YY * NN − YN * NY| - T / 2)^2 Immediately following, before the sentence "If our samples are...", we should insert a clarifying remark: Note that this formulation of the chi-squared statistic includes a term "- T / 2", which is Yates's continuity correction (http://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity) and not included in some formulations of the chi-squared statistic. Then this line of code on 139: val inner = (YY*NN-YN*NY) - T / 2.0 needs to be val inner = math.abs(YY*NN-YN*NY) - T / 2.0 Renat Bekbolatov May 15, 2015 Aug 07, 2015 PDF Page 156 Code segment at end of page The code has some typos in it: the distance() method is missing the beginning curly brace and the call inside it to GeometryEngine.distance() is missing the ending paren Note from the Author or Editor: Yes, the distance function should read: def distance(other: Geometry): Double = { GeometryEngine.distance(geometry, other, spatialReference) } Chaz Chandler Apr 19, 2015 Aug 07, 2015 PDF Page 159 Code block at beginning of page The code blocks described in this section are only a portion of the code necessary. If the intent of the text is for the reader to follow along in the console while reading then the reader will be stuck without referencing the full code on Github (https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/GeoJson.scala). It may be useful to point out that this is just an illustrative excerpt, since adding the full code listing may unnecessarily add to the length of the text. Also the implicit declarations in the text aren't nested within the GeoJsonProtocol object like they are in the Github code. Note from the Author or Editor: I agree. At "Esri Geometry API to represent the longitude and latitude of the pickup and dropoff locations:", let's remove the colon and finish the sentence with a period, then add, "Note that the code listings below are only illustrative extracts from the complete code that you will need to execute to follow along with this chapter. Please refer to the accompanying Chapter 8 source code repository, in particular GeoJson.scala." Chaz Chandler Apr 19, 2015 Aug 07, 2015 PDF Page 165 first line of top code block The first import statement should read "import com.cloudera.datascience.geotime._" (ie, datascience instead of science and geotime instead of geojson). See https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/GeoJson.scala#L6 Note from the Author or Editor: Yes, it should read "import com.cloudera.datascience.geotime._" instead of "import com.cloudera.science.geojson._" Chaz Chandler Apr 19, 2015 Aug 07, 2015 Printed Page 165 (1) First sentence, (2) 5th line from below A couple of minor suggestions: (1) First sentence: "...we need to use ... tools ... in[to] the Spark shell ..." (2) 5th line from below: It is not exactly clear what "frs sequence" stands for, but of course, from context we can guess it is about areaSortedFeatures. Note from the Author or Editor: On page 165, opening sentence, change "Now we need to use" to "Now we need to import" Page 165, "in the frs sequence" should be "in the areaSortedFeatures sequence" Renat Bekbolatov May 19, 2015 Aug 07, 2015 Printed Page 170 First line of code block on the bottom of the page Small typo: previously unseen "bdrdd" -> "boroughDurations" Note from the Author or Editor: Correct, bdrdd should read boroughDurations in the final snippet on 170. Renat Bekbolatov May 19, 2015 Aug 07, 2015 Printed Page 177 code segment at the bottom Just a tiny thing, might be worth noting. Using the current version of codebase and the way it is described in the book, the script actually gets different instruments - not ones that track S&P 500 or Nasdaq index values. (That might also explain unexpected lower correlation numbers between these two on page 185.) Also a couple of other very minor typos to fix in future prints: p. 175, "Variance-Covariance" section, last words: "...deriving a[n] estimate ..." p. 176, "Our Model" section, end of first paragraph: "... of possibl[e|y] different ..." Note from the Author or Editor: The typo fixes are confirmed, yes. I'll ask Sandy to look at the download. Renat Bekbolatov May 21, 2015 Aug 07, 2015 Printed, PDF Page 180 Bottom two code snippets Errors: val stocks: Seq[Array[Double]] = Should be: val stocks: Seq[Array[(DateTime, Double)]] Error val factors: Seq[Array[Double] = Should be val factors: Seq[Array[(DateTime, Double)]] = Note from the Author or Editor: Yes, the type is incorrect and should be Seq[Array[(DateTime, Double)]] in both cases. In the accompanying source code, there's no problem since the type is simply left off. I think it's best to just match that. The type declaration ": Seq[Array[Double]]" can be deleted in both occurrences, leaving just "val stocks = ..." and "val factors = ..." Dr Zach Izham May 15, 2015 Aug 07, 2015 Printed Page 195 Quote, Ch. 10 What is SCHPON[...]? Note from the Author or Editor: It's shorthand for Sulfur, Carbon, Hydrogen, Phosphorous, Oxygen, Nitrogen. I'll try to stick in a footnote. Edberg Feb 15, 2016 PDF Page 199 Last command on page https://github.com/sryza/aas/issues/38 The URL below doesn't work anymore: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam It appears to now be at: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam There's a UK mirror at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam Sean Owen Jul 24, 2015 Aug 07, 2015 PDF Page 199 First line of code The "adamLoad" method used in the text and source was removed after adam 0.16.0, which the text uses. The source repo depends on 0.16.0, but here in the "git clone" command it's important to also checkout 0.16.0: git clone -b adam-parent-0.16.0 https://github.com/bigdatagenomics/adam.git Sean Owen Aug 03, 2015 PDF Page 199 2nd block of code exportADAM_HOME=path/to/adam should be export ADAM_HOME=path/to/adam or ADAM_HOME=path/to/adam; export ADAM_HOME (no dollar sign in any case) This works in bash, sh and ksh In tcsh or csh it should be setenv ADAM_HOME path/to/adam

Note from the Author or Editor:
Yes, the "export ADAM_HOME=..." is correct.

David G Pisano  Sep 22, 2015
PDF
Page 222

The location of the resource hyperlinked as python/thunder/utils/data/fish/tif-stack has changed. It should be: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack

Sean Owen

May 01, 2015  Aug 07, 2015
PDF
Page 240
2nd paragraph

"A task only contributes to the accumulator the first time it runs. For example, if a task completes successfully, but its outputs are lost and it needs to be rerun, it will not increment the accumulator again." That's wrong. It is only true for actions. When using Accumulators in transformations they may be incremented multiple times. This can happen for various reasons (reuse of RDD, task failure, task rerun etc.) https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka

Note from the Author or Editor:
Yes I think this at least has to be converted to a note explaining that this could over-count as implemented here. Better would be to convert the example to make the accumulator updates inside aggregate() instead somehow, as that's correct. Sandy WDYT?

Lars Francke  Jul 30, 2015  Aug 07, 2015
PDF
Page 240
2nd & 3rd paragraph

Thanks for fixing the Accumulator docs. 2nd paragraph (the new one): "but its output are lots" -> lost 3rd paragraph: This paragraph is very hard to understand and parse for me. Especially now that the previous paragraph basically says the opposite.

Note from the Author or Editor:
Indeed lots -> lost needs to be fixed. Sandy up to you whether you want to revise the 3rd paragraph. The link may be: "For cases where these behaviors are acceptable, accumulators can be a big win, because ..."

Lars Francke  Aug 18, 2015
Printed
Page 242
last line before section beginning

To make it work by default (without implicit conversions): val (train, test) = ... to val Array(train, test) = ...

Note from the Author or Editor:
Yes, this needs to be val Array(train, test) = ... to work "out of the box". This line of code wasn't actually in the repo as it was just an example, so didn't catch it as a compiler error.

Renat Bekbolatov  May 11, 2015  Aug 07, 2015
Printed
Page 242
first line

Code explanation (1) of first line in this page from "Swap the order of the tuples to sort on the numbers instead of the counts" I think that example code in previous page sorts data on the counts instead of the numbers. Am I incorrect?

Note from the Author or Editor:
You're right that this is what the example on the previous page does, and it's also what this example does. Really, the bullet text should be different, to explain that it's doing this to still sort on count. I will adjust it.

Edberg  Feb 12, 2016
ePub
Page 314
1st paragraph in "Document-Document Relevance"

"where u sub-i is the row in U corresponding to term i". I think that "term" should be "document", since U is the document space.

Note from the Author or Editor:
Yes, it should read "document i".

Brent Schneeman  Apr 14, 2015  Aug 07, 2015