Learning Spark

Errata for Learning Spark

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
Safari Books Online
http://techbus.safaribooksonline.com/9781449359034/subsec_passing_functions_html#example3-21
Example 3-21

In Example 3-21, return types for the getMatches* methods are incorrect. getMatchesFunctionReference() should return RDD[Boolean], and getMatchesFieldReference() and getMatchesNoReference() should either return RDD[Array[String]] or change the implementation to use flatMap instead of map.

Note from the Author or Editor:
I've updated the return types, thanks for catching this.

Anonymous  Feb 19, 2015  Mar 27, 2015
PDF
Page vii
2nd parapgraph

duplicated wording READ: "You’ll learn how to learn how to download..." SHOULD READ: "You’ll learn how to download..."

Note from the Author or Editor:
Fixed in fe6dc3e1dd493a83464e115a4309ab806cf240cb

Ricardo Almeida  Oct 08, 2014  Jan 26, 2015
PDF
Page 9, 10
P9 - Downloading Spark: P1; P10 - 1st paragraph after the notes

Page 9 has the following text: This will download a compressed tar file, or “tarball,” called spark-1.1.0-bin-hadoop1.tgz . On page 10, a different tarball is referenced: cd ~ tar -xf spark-1.1.0-bin-hadoop2.tgz cd spark-1.1.0-bin-hadoop2

Kevin D'Elia  Oct 20, 2014  Jan 26, 2015
PDF
Page 20
Fifth line from the bottom.

"Example 2-13. Maven build file" is invalid because there is an extra </plugin> closing tag. Bad: </configuration> </plugin> </plugin> Better: </configuration> </plugin>

Note from the Author or Editor:
Removed extra plugin tag (done by author).

Michah Lerner  Feb 02, 2015  Mar 27, 2015
PDF
Page 20
Example 2-13. maven build example

Is there any reason why Akka repo is needed to build the mini project? It seems like all dependencies of spark-core_2.10:1.1.0 are already available in the maven central.

Note from the Author or Editor:
I have removed the aka repo from our mini example.

Uladzimir Makaranka  Sep 21, 2014  Jan 26, 2015
PDF
Page 20
Example 2-13

<artifactId>learning-spark-mini-example/artifactId> is missing closing <

Note from the Author or Editor:
Fixed in b99b12fcd3022c298d30f3fcd2b1d88fd7eab57c

Kevin D'Elia  Oct 19, 2014  Jan 26, 2015
PDF
Page 21
Example 2-15

Maven command line executable is called 'mvn'. Please replace "maven clean && maven compile && maven package" with "mvn clean && mvn compile && mvn package". Also the maven build script (Example 2-13) doesn't compile scala code (i.e. c.o.l.mini.scala), please replace "$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount \ ./target/learning-spark-mini-example-0.0.1.jar ./README.md ./wordcounts" with "$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.WordCount \ ./target/learning-spark-mini-example-0.0.1.jar ./README.md ./wordcounts"

Note from the Author or Editor:
Fixed

Uladzimir Makaranka  Sep 21, 2014  Jan 26, 2015
PDF, ePub
Page 21
Example 2-14 and Example 2-15

In order to match with the code in Github: com.oreilly.learningsparkexamples.mini.Scala.WordCount should be: com.oreilly.learningsparkexamples.mini.scala.WordCount and com.oreilly.learningsparkexamples.mini.Java.WordCount should be: com.oreilly.learningsparkexamples.mini.java.WordCount Lower case scala and java as paths. Compilation fails otherwise.

Note from the Author or Editor:
I've fixed this in the copy edit version we got back.

Murali Raju  Dec 13, 2014  Jan 26, 2015
PDF
Page 33
Figure 3-3

READ: RDD2.subtract(RDD2) {panda,tea} SHOULD READ: RDD1.subtract(RDD2) {panda, tea}

Note from the Author or Editor:
I've fixed this is the latest build for author provided images, but if O'Reilly has already started remaking the images you may need to redo the Figure 3-3 bottom right as the submitter has suggested.

Tatsuo Kawasaki  Aug 18, 2014  Jan 26, 2015
Printed
Page 36
figure 3-4

Page 36, figure 3-4, RDD2: list should be: coffee, monkey, kitty. Currently money is there instead of monkey.

Note from the Author or Editor:
Figure 3-4 should be updated to say monkey instead of money.

Anonymous  Mar 02, 2015  Mar 27, 2015
PDF
Page 37
Figure 3-2. Map and filter on an RDD

FilteredRDD {1,4,9,16} should be FilteredRDD {2,3,4}

Note from the Author or Editor:
Thanks for pointing this out, I've gone ahead and fixed this and it should be in our next build.

Tang Yong  Aug 18, 2014  Jan 26, 2015
PDF
Page 37
Example 3-24. Scala squaring the values in an RDD

println(result.collect()) should be result.collect().foreach{x=>println(x)}

Note from the Author or Editor:
Fixed in the latest build.

Tang Yong  Aug 18, 2014  Jan 26, 2015
PDF
Page 40
Example 3-35. Python

Python aggregate example sumCount = nums.aggregate((0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), >>add missing paren<< (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) Last line remove extra parenthesis

Note from the Author or Editor:
Fixed by author in 54759cf2cf0e41b81bdd56eaa5adb308ac911845

Anonymous  Jan 25, 2015  Mar 27, 2015
PDF
Page 50
Table 4-2

Right outer join and left outer join "Purpose" descriptions are reversed; in the right outer join, the key must be present in the "other" RDD, not "this" RDD. Reverse mistake is made in the left outer join purpose description. It's clear from looking at the "Result" columns, which are correct, that in the right-join case the only key in the result is from "other", while in left-join the keys in the results are from "this". From scaladoc for right outer join: For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k.

Note from the Author or Editor:
Great catch, I've swapped the two.

Wayne M Adams  Feb 24, 2015  Mar 27, 2015
Printed
Page 53
Example 4-11. Second line.

Shouldn't "rdd.flatMap(...)" be "input.flatMap(...)"

Note from the Author or Editor:
Fixed in atlass

Jim Williams  Apr 06, 2015 
PDF
Page 54
United States

Example 4-12 does not print out its results as the others do. Also, 4-13 should arguably use a foreach to print as it uses side effects.

Note from the Author or Editor:
Fixed print and swapped to foreach in 6f5d7e5d065f88e4df46e03a61fb5b70d8982649

Justin Pihony  Jan 25, 2015  Mar 27, 2015
PDF
Page 57
Example 4-16

Apparent cut-and-paste mistake: the "Custom parallelism" example is the same as the default one, in that no parallelism Int was specified in the example call.

Note from the Author or Editor:
Fixed, thanks :)

Wayne M Adams  Feb 24, 2015  Mar 27, 2015
ePub
Page 58
Example 4-12

Example 4-12 (Python) is not equivalent to the others: the sum of numbers must be divided by the count to yield the average. Having the Python example implement the same behavior as the Scala and Java examples will aid the reader. My version of the example is: nums = sc.parallelize([(1,2),(1,4),(3,6),(4,6),(4,8),(4,13)]) sumCount = nums.combineByKey((lambda x : (x , 1)), (lambda x, y : (x[0] + y, x[1] + 1)), (lambda x ,y : (x [0] + y[0], x[1] + y[1]))) print sumCount.map(lambda (k,v): (k, v[0]/float(v[1]))).collect()

Note from the Author or Editor:
no action, fixed in book already.

Andres Moreno  Dec 02, 2014  Jan 26, 2015
Printed
Page 65
Example 4-24

In example 4-24, "val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))" will not work, as it is not imported in the example. Either an import or a change into "new org.apache.spark.HashPartitioner(2)" would work.

Note from the Author or Editor:
Thanks for pointing this out, I'll update our example to include the import. Fixed in cd090206381a9bbf0466468bf7128a808085522f.

Tom Hubregtsen  Mar 10, 2015  Mar 27, 2015
PDF
Page 66
JSON

It is mentioned that liftweb-json is used for JSON-parsing, however Play JSON is used for parsing and then liftweb-json for JSON output. This is a bit confusing.

Note from the Author or Editor:
I've fixed this in the latest push.

Anonymous  Aug 05, 2014  Jan 26, 2015
PDF
Page 70
United States

feildnames

Note from the Author or Editor:
Fixed in the latest build (typo)

Anonymous  Aug 17, 2014  Jan 26, 2015
PDF
Page 70
first paragraph

"In Python if an value isn’t present None is used and if the value is present the regular value" should be "In Python if a value isn’t present None is used and if the value is present the regular value"

Note from the Author or Editor:
Fixed in atlass

Mark Needham  Nov 30, 2014  Jan 26, 2015
PDF
Page 73
Example 5-4

(p. 91 of the PDF doc; p. 73 of the book). This is a total nitpick, but the file url is file://home/holden/salesFiles and instead should be file:///home/holden/salesFiles

Note from the Author or Editor:
Thanks, fixed :)

Wayne M Adams  Feb 26, 2015  Mar 27, 2015
Printed
Page 82
example 5-20

Example 5-20. Loading a SequenceFile in Python should drop the "val" on "val data = ..." Works otherwise.

Note from the Author or Editor:
Thanks for catching this, I went ahead and fixed this in atlass.

jonathan greenleaf  Apr 09, 2015 
PDF
Page 85
Example 5-13/5-14

Minor issue; there should be a import Java.io.StringReader statement in your CSV loading examples in Scala (and presumably Java)

Note from the Author or Editor:
I fixed in holden@hmbp2:~/repos/1230000000573$ git log commit a9f9f34a3b8513885325f47c1101e657cb5faa89

Timothy Elser  Oct 07, 2014  Jan 26, 2015
ePub
Page 87

"We have looked at the fold, combine, and reduce actions on basic RDDs". There is no RDD.combine(), did you mean aggregate()?

Note from the Author or Editor:
Replace combine with aggregate (fixed in f7df06b0c1d730a3a20f173dea8d4ce5c137aa0d).

Thomas Oldervoll  Jan 25, 2015  Mar 27, 2015
PDF
Page 91
Example 5-31

(p. 109 PDF document; page 91 of book). Minor -- with the import of the HiveContext class, there's no need to fully qualify the class name when invoking the HiveContext constructor.

Note from the Author or Editor:
Thanks for catching this, I've simplified the code as suggested in b9d7e376aae27e2f8d4de6d431691a62852d92ba.

Wayne M Adams  Feb 26, 2015  Mar 27, 2015
PDF
Page 102
Third paragraph

Don't need a comma before the word "or" in: "... when there are multiple values to keep track of, or when the same value needs..." "... percentage of our data to be corrupted, or allow for the backend to fail..."

Note from the Author or Editor:
Fixed.

Anonymous  Feb 04, 2015  Mar 27, 2015
ePub
Page 112
3rd

Text reads: “Spark has many levels of persistence to chose from based on what our goals are. ” should read: “Spark has many levels of persistence to choose from based on what our goals are. ”

Note from the Author or Editor:
fixed in latest version of atlass

Bruce Sanderson  Nov 15, 2014  Jan 26, 2015
ePub
Page 126
1st paragraph

The text "...to how we used fold and map compute the entire RDD average” should read: “ ...to how we used fold and map to compute the entire RDD average”

Note from the Author or Editor:
Fixed in atlass

Bruce Sanderson  Nov 18, 2014  Jan 26, 2015
PDF, ePub
Page 130
5th step

It says "(...) run bin/stop-all.sh (...)" It should be "(...) run sbin/stop-all.sh (...)"

Note from the Author or Editor:
Thanks for catching that, I've updated it in adbdb12def7218b0f54cb67f96cb688775e05ec5

Alejandro Ramon Lopez del Huerto  Mar 10, 2015  Mar 27, 2015
PDF
Page 146
Example 8-7

The code example: scala> val tokenized = input. | map(line => line.split(" ")). | filter(words => words.size > 0) on my machine (Spark 1.2.1, Scala 2.10.4, Ubuntu 14.04) gives the following: scala> tokenized.collect() res1: Array[Array[String]] = Array(Array(INFO, This, is, a, message, with, content), Array(INFO, This, is, some, other, content), Array(""), Array(INFO, Here, are, more, messages), Array(WARN, This, is, a, warning), Array(""), Array(ERROR, Something, bad, happened), Array(WARN, More, details, on, the, bad, thing), Array(INFO, back, to, normal, messages)) Note there are two one-element arrays with a single empty string in each -- these are for the two empty lines. Apparently the filter on "words.size > 0" does not give the expected result, since an empty line, split on " ", give an array of length 1 with one empty element, rather than an array of length 0. So it doesn't filter. The result of collect()ing on counts is: scala> counts.collect() res0: Array[(String, Int)] = Array((ERROR,1), (INFO,4), ("",2), (WARN,2)) in my file, each empty line is just a newline character.

Note from the Author or Editor:
This is a good catch, since split on an empty string returns an array with a single element the result isn't what we want. Swapping the order of the map/filter does what we want. Fixed in 0374336d16ebb32ca3452b37c7bb1642ca0755a3.

Wayne M Adams  Mar 10, 2015  Mar 27, 2015
Printed
Page 157
6th line down of text

Extra "this".

Note from the Author or Editor:
Thanks for catching this, fixed in atlass.

Jim Williams  Apr 07, 2015 
PDF
Page 162
2nd Paragraph of section called "Linking with Spark SQL"

Text page 162, PDF page 180, of the 1 April edition contains the following fragment, with duplicated reference to Hive query language: "...and the Hive query language (HiveQL). Hive query language (HQL) It is important..."

Note from the Author or Editor:
Thanks for catching this, I think this was from an indexing tag that accidentally got included in the text. I've changed this in atlass and it should be removed in the next update.

Wayne M Adams  Apr 02, 2015 
PDF
Page 163
Table

Table 9.1 lists the Scala and Java types/imports for Timestamp. java.sql.TimeStamp should be java.sql.Timestamp

Note from the Author or Editor:
Fixed.

Anirudh Koul  Feb 03, 2015  Mar 27, 2015