Errata

Errata for Learning Spark

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	http://techbus.safaribooksonline.com/9781449359034/subsec_passing_functions_html#example3-21 Example 3-21	In Example 3-21, return types for the getMatches* methods are incorrect. getMatchesFunctionReference() should return RDD[Boolean], and getMatchesFieldReference() and getMatchesNoReference() should either return RDD[Array[String]] or change the implementation to use flatMap instead of map. Note from the Author or Editor: I've updated the return types, thanks for catching this.	Anonymous	Feb 19, 2015	Mar 27, 2015
	ex6-16 example 6-16, code	The R library Imap function gdist takes as the first four arguments longitude, latitude, longitude, and latitude. The calling code writes values to stdout in the wrong order (latitude, longitude, ...). This order is not corrected in the R code that passes them to gdist. Examples 6-17 and 6-18 have the same bug. Note from the Author or Editor: That's correct, I've fixed this in 6-17/18 in atlas.	Waclaw Kusnierczyk	Jul 09, 2015
PDF	Page vii 2nd parapgraph	duplicated wording READ: "You’ll learn how to learn how to download..." SHOULD READ: "You’ll learn how to download..." Note from the Author or Editor: Fixed in fe6dc3e1dd493a83464e115a4309ab806cf240cb	Ricardo Almeida	Oct 08, 2014	Jan 26, 2015
PDF	Page 9, 10 P9 - Downloading Spark: P1; P10 - 1st paragraph after the notes	Page 9 has the following text: This will download a compressed tar file, or “tarball,” called spark-1.1.0-bin-hadoop1.tgz . On page 10, a different tarball is referenced: cd ~ tar -xf spark-1.1.0-bin-hadoop2.tgz cd spark-1.1.0-bin-hadoop2	Kevin D'Elia	Oct 20, 2014	Jan 26, 2015
PDF	Page 20 Fifth line from the bottom.	"Example 2-13. Maven build file" is invalid because there is an extra </plugin> closing tag. Bad: </configuration> </plugin> </plugin> Better: </configuration> </plugin> Note from the Author or Editor: Removed extra plugin tag (done by author).	Michah Lerner	Feb 02, 2015	Mar 27, 2015
PDF	Page 20 Example 2-13. maven build example	Is there any reason why Akka repo is needed to build the mini project? It seems like all dependencies of spark-core_2.10:1.1.0 are already available in the maven central. Note from the Author or Editor: I have removed the aka repo from our mini example.	Uladzimir Makaranka	Sep 21, 2014	Jan 26, 2015
PDF	Page 20 Example 2-13	<artifactId>learning-spark-mini-example/artifactId> is missing closing < Note from the Author or Editor: Fixed in b99b12fcd3022c298d30f3fcd2b1d88fd7eab57c	Kevin D'Elia	Oct 19, 2014	Jan 26, 2015
PDF	Page 21 Example 2-15	Maven command line executable is called 'mvn'. Please replace "maven clean && maven compile && maven package" with "mvn clean && mvn compile && mvn package". Also the maven build script (Example 2-13) doesn't compile scala code (i.e. c.o.l.mini.scala), please replace "$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount \ ./target/learning-spark-mini-example-0.0.1.jar ./README.md ./wordcounts" with "$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.WordCount \ ./target/learning-spark-mini-example-0.0.1.jar ./README.md ./wordcounts" Note from the Author or Editor: Fixed	Uladzimir Makaranka	Sep 21, 2014	Jan 26, 2015
PDF, ePub	Page 21 Example 2-14 and Example 2-15	In order to match with the code in Github: com.oreilly.learningsparkexamples.mini.Scala.WordCount should be: com.oreilly.learningsparkexamples.mini.scala.WordCount and com.oreilly.learningsparkexamples.mini.Java.WordCount should be: com.oreilly.learningsparkexamples.mini.java.WordCount Lower case scala and java as paths. Compilation fails otherwise. Note from the Author or Editor: I've fixed this in the copy edit version we got back.	Murali Raju	Dec 13, 2014	Jan 26, 2015
PDF	Page 25 Example 3-4	Small typo in Example 3-4. READ: >>> pythonLines.persist SHOULD READ: >>> pythonLines.persist() Note from the Author or Editor: I've added the missing brackets for the dev version.	Tatsuo Kawasaki	May 01, 2015
PDF	Page 29 Example 3-15	Python example : Example 3-15 below line return TypeError. READ: print "Input had " + badLinesRDD.count() + " concerning lines" --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-61-13cf6c82420f> in <module>() ----> 1 print "Input had " + badLinesRDD.count() + " concerning lines" "TypeError: cannot concatenate 'str' and 'int' objects" SHOULD READ: print "Input had %i" % badLinesRDD.count() + " concerning lines" or print "Input had " + str(badLinesRDD.count()) + " concerning lines" Note from the Author or Editor: Thanks, this has been fixed.	Tatsuo Kawasaki	May 01, 2015
PDF	Page 29 Example 3-17	No semicolon ends. READ: System.out.println("Input had " + badLinesRDD.count() + " concerning lines") System.out.println("Here are 10 examples:") SHOULD READ: System.out.println("Input had " + badLinesRDD.count() + " concerning lines"); System.out.println("Here are 10 examples:"); Note from the Author or Editor: Thanks for catching this, I've added in the missing semicolons to example 3-17.	Tatsuo Kawasaki	May 01, 2015
Printed	Page 29 Code example 3-15	Greetings, I'm going through "Learning Spark" (3rd release, 1st edition). Question on the top of top of page 29 of this Python example (example 3-15): Print "Input had " + badLinesRDD.count() + " concerning lines" I'm pretty sure this is wrong, because if you try that you get this error... Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: cannot concatenate 'str' and 'int' objects ... Python won't auto-convert ints to Strings like Java or Scala. Context just in case I did something wrong in setting things up: Douglass-MacBook-Pro-2:spark-1.4.1-bin-hadoop2.6 dmeil$ ./bin/pyspark Python 2.7.10 \|Anaconda 2.1.0 (x86_64)\| (default, May 28 2015, 17:04:42) [GCC 4.2.1 (Apple Inc. build 5577)] on darwin Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://binstar.org Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Python version 2.7.10 (default, May 28 2015 17:04:42) SparkContext available as sc, HiveContext available as sqlContext. Note from the Author or Editor: Thanks for catching this, I've added an explicit str :)	Doug Meil	Jul 28, 2015
PDF	Page 32 1	I have been earlier asked by the author to return the book because I reported an issue. I am therefore not writing this with the intention of it being corrected, but just to let others know that how the code can be run correctly. The following code does not run: ----------------------------------------------------------------------------------------------- class SearchFunctions(val query: String) { def isMatch(s: String): Boolean = { s.contains(query) } def getMatchesFunctionReference(rdd: RDD[String]): RDD[String] = { // Problem: "isMatch" means "this.isMatch", so we pass all of "this” rdd.map(isMatch) } def getMatchesFieldReference(rdd: RDD[String]): RDD[String] = { // Problem: "query" means "this.query", so we pass all of "this" rdd.map(x => x.split(query)) } def getMatchesNoReference(rdd: RDD[String]): RDD[String] = { // Safe: extract just the field we need into a local variable val query_ = this.query rdd.map(x => x.split(query_)) } } ----------------------------------------------------------------------------------------------- And it can be modified or updated to the following so that it can run: ----------------------------------------------------------------------------------------------- // As the RDD class is not automatically imported therefore we have to import it explicitly import org.apache.spark.rdd.RDD class SearchFunctions(val query: String) { def isMatch(s : String ): Boolean = { s.contains(query) } def getMatchesFunctionReference(rdd: RDD[String]): RDD[Boolean] = { // Problem: "isMatch" means "this.isMatch", so we pass all of "this" rdd.map(isMatch) } def getMatchesFieldReference(rdd: RDD[String]): RDD[Array[String]] = { // Problem: "query" means "this.query", so we pass all of "this" rdd.map(x => x.split(query)) } def getMatchesNoReference(rdd: RDD[String]): RDD[Array[String]] = { // Safe: extract just the field we need into a local variable val query_ = this.query rdd.map(x => x.split(query_)) } } ----------------------------------------------------------------------------------------------- Regards, Gourav Note from the Author or Editor: We should include the import org.apache.spark.rdd.RDD in the standard imports.	Gourav Sengupta	Apr 15, 2015	May 08, 2015
PDF	Page 33 Example 3-23. Java function passing with named class	There are extra () in the class i.e. class ContainsError implements Function<String, Boolean>() { public Boolean call(String x) { return x.contains("error"); } } Should be class ContainsError implements Function<String, Boolean> { public Boolean call(String x) { return x.contains("error"); } } See https://spark.apache.org/docs/1.4.1/programming-guide.html#passing-functions-to-spark Kind Regards Note from the Author or Editor: Removed extra (), thanks	Guillermo Schiava	Aug 12, 2015
PDF	Page 33 Figure 3-3	READ: RDD2.subtract(RDD2) {panda,tea} SHOULD READ: RDD1.subtract(RDD2) {panda, tea} Note from the Author or Editor: I've fixed this is the latest build for author provided images, but if O'Reilly has already started remaking the images you may need to redo the Figure 3-3 bottom right as the submitter has suggested.	Tatsuo Kawasaki	Aug 18, 2014	Jan 26, 2015
Printed	Page 36 figure 3-4	Page 36, figure 3-4, RDD2: list should be: coffee, monkey, kitty. Currently money is there instead of monkey. Note from the Author or Editor: Figure 3-4 should be updated to say monkey instead of money.	Anonymous	Mar 02, 2015	Mar 27, 2015
PDF	Page 37 Figure 3-2. Map and filter on an RDD	FilteredRDD {1,4,9,16} should be FilteredRDD {2,3,4} Note from the Author or Editor: Thanks for pointing this out, I've gone ahead and fixed this and it should be in our next build.	Tang Yong	Aug 18, 2014	Jan 26, 2015
PDF	Page 37 Example 3-24. Scala squaring the values in an RDD	println(result.collect()) should be result.collect().foreach{x=>println(x)} Note from the Author or Editor: Fixed in the latest build.	Tang Yong	Aug 18, 2014	Jan 26, 2015
PDF	Page 40 Example 3-35. Python	Python aggregate example sumCount = nums.aggregate((0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), >>add missing paren<< (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) Last line remove extra parenthesis Note from the Author or Editor: Fixed by author in 54759cf2cf0e41b81bdd56eaa5adb308ac911845	Anonymous	Jan 25, 2015	Mar 27, 2015
Printed	Page 45 Example 3-40	In example 3-40, "result.persist(StorageLevel.DISK_ONLY)" will not work, as it is not imported in the example. Adding "import org.apache.spark.storage.StorageLevel" will fix this. Note from the Author or Editor: Thanks for pointing this out, we mention the package that the StorageLevels come from in the persistence table but I've added an import in the example code for clarity.	Tom Hubregtsen	Apr 19, 2015	May 08, 2015
PDF	Page 50 Table 4-2	Right outer join and left outer join "Purpose" descriptions are reversed; in the right outer join, the key must be present in the "other" RDD, not "this" RDD. Reverse mistake is made in the left outer join purpose description. It's clear from looking at the "Result" columns, which are correct, that in the right-join case the only key in the result is from "other", while in left-join the keys in the results are from "this". From scaladoc for right outer join: For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Note from the Author or Editor: Great catch, I've swapped the two.	Wayne M Adams	Feb 24, 2015	Mar 27, 2015
Printed	Page 53 Example 4-11. Second line.	Shouldn't "rdd.flatMap(...)" be "input.flatMap(...)" Note from the Author or Editor: Fixed in atlass	Jim Williams	Apr 06, 2015	May 08, 2015
PDF	Page 54 United States	Example 4-12 does not print out its results as the others do. Also, 4-13 should arguably use a foreach to print as it uses side effects. Note from the Author or Editor: Fixed print and swapped to foreach in 6f5d7e5d065f88e4df46e03a61fb5b70d8982649	Justin Pihony	Jan 25, 2015	Mar 27, 2015
PDF	Page 57 Example 4-16	Apparent cut-and-paste mistake: the "Custom parallelism" example is the same as the default one, in that no parallelism Int was specified in the example call. Note from the Author or Editor: Fixed, thanks :)	Wayne M Adams	Feb 24, 2015	Mar 27, 2015
ePub	Page 58 Example 4-12	Example 4-12 (Python) is not equivalent to the others: the sum of numbers must be divided by the count to yield the average. Having the Python example implement the same behavior as the Scala and Java examples will aid the reader. My version of the example is: nums = sc.parallelize([(1,2),(1,4),(3,6),(4,6),(4,8),(4,13)]) sumCount = nums.combineByKey((lambda x : (x , 1)), (lambda x, y : (x[0] + y, x[1] + 1)), (lambda x ,y : (x [0] + y[0], x[1] + y[1]))) print sumCount.map(lambda (k,v): (k, v[0]/float(v[1]))).collect() Note from the Author or Editor: no action, fixed in book already.	Andres Moreno	Dec 02, 2014	Jan 26, 2015
PDF	Page 60 Table 4-3	collectAsMap() doesn't return multi Map. so Result should be Map{(1,2), (3, 6)} https://github.com/apache/spark/blob/b0d884f044fea1c954da77073f3556cd9ab1e922/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L659 /** * Return the key-value pairs in this RDD to the master as a Map. * * Warning: this doesn't return a multimap (so if you have multiple values to the same key, only * one value per key is preserved in the map returned) Note from the Author or Editor: Thanks for catching that, I've updated the example.	Tatsuo Kawasaki	May 08, 2015
PDF	Page 64 4th paragraph	Page 64, Section "Determining an RDD’s Partitioner", second line says, "or partitioner() method in Java". There is no method "partitioner()" available on "org.apache.spark.api.java.JavaPairRDD." (spark version 1.3.1) Is this a typo for the method "partitions()"? Note from the Author or Editor: Seems that the partitioner() function doesn't exist. I'll drop it. (partitions() doesn't quite return the partitioner rather a list of the partitions).	Anonymous	May 09, 2015
Printed	Page 65 Example 4-24	In example 4-24, "val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))" will not work, as it is not imported in the example. Either an import or a change into "new org.apache.spark.HashPartitioner(2)" would work. Note from the Author or Editor: Thanks for pointing this out, I'll update our example to include the import. Fixed in cd090206381a9bbf0466468bf7128a808085522f.	Tom Hubregtsen	Mar 10, 2015	Mar 27, 2015
PDF	Page 66 JSON	It is mentioned that liftweb-json is used for JSON-parsing, however Play JSON is used for parsing and then liftweb-json for JSON output. This is a bit confusing. Note from the Author or Editor: I've fixed this in the latest push.	Anonymous	Aug 05, 2014	Jan 26, 2015
PDF	Page 67 United States	The case under //Run 10 iterations shadows the links variable. This might be confusing for new developers Note from the Author or Editor: Thanks, thats a good point since that behaviour is maybe confusing for people coming from other languages. I've clarified it by using a different local variable in the dev verison.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 70 United States	feildnames Note from the Author or Editor: Fixed in the latest build (typo)	Anonymous	Aug 17, 2014	Jan 26, 2015
PDF	Page 70 first paragraph	"In Python if an value isn’t present None is used and if the value is present the regular value" should be "In Python if a value isn’t present None is used and if the value is present the regular value" Note from the Author or Editor: Fixed in atlass	Mark Needham	Nov 30, 2014	Jan 26, 2015
PDF	Page 72 United States	"The input formats that Spark wraps all transparently handle compressed formats based on the file extension." is an awkwardly worded sentence. Note from the Author or Editor: Improved a bit :)	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 73 Example 5-4	(p. 91 of the PDF doc; p. 73 of the book). This is a total nitpick, but the file url is file://home/holden/salesFiles and instead should be file:///home/holden/salesFiles Note from the Author or Editor: Thanks, fixed :)	Wayne M Adams	Feb 26, 2015	Mar 27, 2015
PDF	Page 73 United States	"Sometimes it’s important to know which file which piece of input came from" should probably be "Sometimes it’s important to know which file each piece of input came from" Note from the Author or Editor: Thanks, fixed in the dev version.	Justin Pihony	Apr 27, 2015	May 08, 2015
Printed, PDF	Page 76 Example 5-10	Reads: result.filter(p => P.lovesPandas).map(mapper.writeValueAsString(_)) .saveAsTextFile(outputFile) Should read: result.filter(p => p.lovesPandas).map(mapper.writeValueAsString(_)) .saveAsTextFile(outputFile) Note from the Author or Editor: Thanks for catching this. I've fixed the case issue in this example.	Myles Baker	May 11, 2015
PDF	Page 78 United States	import Java.io.StringReader should use a lowercase j This happens in a number of locations actually. Note from the Author or Editor: Thanks for catching this. I've applied a global fix to the dev copy.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 79 United States	"If there are only a few input files, and you need to use the wholeFile() method," should be "If there are only a few input files, and you need to use the wholeTextFile() method," Note from the Author or Editor: Thanks, fixed to wholeTextFiles.	Justin Pihony	Apr 27, 2015	May 08, 2015
Printed	Page 82 example 5-20	Example 5-20. Loading a SequenceFile in Python should drop the "val" on "val data = ..." Works otherwise. Note from the Author or Editor: Thanks for catching this, I went ahead and fixed this in atlass.	jonathan greenleaf	Apr 09, 2015	May 08, 2015
PDF	Page 84 United States	"A similar function, hadoopFile(), exists for working with Hadoop input formats implemented with the older API." This sentence is in respect to newAPIHadoopFile and should be moved up by one sentence. Maybe as a side-note as it would then throw off the flow into talking about the 3 classes. Note from the Author or Editor: re-arranged that paragraph for clarity.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 85 Example 5-13/5-14	Minor issue; there should be a import Java.io.StringReader statement in your CSV loading examples in Scala (and presumably Java) Note from the Author or Editor: I fixed in holden@hmbp2:~/repos/1230000000573$ git log commit a9f9f34a3b8513885325f47c1101e657cb5faa89	Timothy Elser	Oct 07, 2014	Jan 26, 2015
ePub	Page 87	"We have looked at the fold, combine, and reduce actions on basic RDDs". There is no RDD.combine(), did you mean aggregate()? Note from the Author or Editor: Replace combine with aggregate (fixed in f7df06b0c1d730a3a20f173dea8d4ce5c137aa0d).	Thomas Oldervoll	Jan 25, 2015	Mar 27, 2015
PDF	Page 90 United States	"you can specify SPARK_HADOOP_VERSION= as a environment variable" should be as AN environment variable. Note from the Author or Editor: Fixed :)	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 91 Example 5-31	(p. 109 PDF document; page 91 of book). Minor -- with the import of the HiveContext class, there's no need to fully qualify the class name when invoking the HiveContext constructor. Note from the Author or Editor: Thanks for catching this, I've simplified the code as suggested in b9d7e376aae27e2f8d4de6d431691a62852d92ba.	Wayne M Adams	Feb 26, 2015	Mar 27, 2015
PDF	Page 95 United States	Why is the SparkContext and JavaSparkContext in example 5-40 and 5-41 using different arguments? If no reason, then they should be synchronized. Note from the Author or Editor: Unified :)	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 101 United States	Example 6-3 creates the SparkContext, while the other examples do not. Note from the Author or Editor: Unified, thanks.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 102 Third paragraph	Don't need a comma before the word "or" in: "... when there are multiple values to keep track of, or when the same value needs..." "... percentage of our data to be corrupted, or allow for the backend to fail..." Note from the Author or Editor: Fixed.	Anonymous	Feb 04, 2015	Mar 27, 2015
PDF	Page 103 United States	Example 6-5 outputs Too many errors: # in # But, this would be "invalid in valid", where what is really needed is "invalid in total" or another wording. Note from the Author or Editor: True, changed the output in the example to clarify.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 111 United States	String interpolation in example 6-17 needs to be in brackets as it uses a property of the y object. Note from the Author or Editor: Good catch, fixed in the dev build.	Justin Pihony	Apr 27, 2015	May 08, 2015
ePub	Page 112 3rd	Text reads: “Spark has many levels of persistence to chose from based on what our goals are. ” should read: “Spark has many levels of persistence to choose from based on what our goals are. ” Note from the Author or Editor: fixed in latest version of atlass	Bruce Sanderson	Nov 15, 2014	Jan 26, 2015
PDF	Page 114 United States	std.stdev in example 6-19 should be stats.stdev Note from the Author or Editor: Fixed in dev :)	Justin Pihony	Apr 27, 2015	May 08, 2015
Printed, PDF	Page 123 Example 7-4	Incorrect variable name: # Submitting a Python application in YARN client mode $ export HADOP_CONF_DIR=/opt/hadoop/conf Should be HADOOP_CONF_DIR Note from the Author or Editor: I've swithced HADOP to HADOOP in the example as pointed out.	Myles Baker	May 13, 2015
ePub	Page 126 1st paragraph	The text "...to how we used fold and map compute the entire RDD average” should read: “ ...to how we used fold and map to compute the entire RDD average” Note from the Author or Editor: Fixed in atlass	Bruce Sanderson	Nov 18, 2014	Jan 26, 2015
PDF	Page 127 United States	The comment in example 7-7 "A special option to exclude Scala itself form our assembly JAR, since Spark" should be "A special option to exclude Scala itself FROM our assembly JAR, since Spark" Note from the Author or Editor: Fixed :)	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF, ePub	Page 130 5th step	It says "(...) run bin/stop-all.sh (...)" It should be "(...) run sbin/stop-all.sh (...)" Note from the Author or Editor: Thanks for catching that, I've updated it in adbdb12def7218b0f54cb67f96cb688775e05ec5	Alejandro Ramon Lopez del Huerto	Mar 10, 2015	Mar 27, 2015
PDF	Page 134 United States	"to elect a master when running in multimaster node" should be "to elect a master when running in multimaster MODE" Note from the Author or Editor: fixed in dev.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 141 Example 8-1	Example 8-1. Creating an application using a SparkConf in Python 1) You should remove the 'new' in conf = new SparkConf() 2) You should change the argument in SparkContext() https://spark.apache.org/docs/latest/programming-guide.html#initializing-spark https://issues.apache.org/jira/browse/SPARK-2003 Correct code would be: conf = SparkConf() conf.setAppName("My Spark App") conf.set("spark.master", "local[4]") conf.set("spark.ui.port", "36000") sc = SparkContext(conf=conf) OR conf = SparkConf() \ .setAppName("My Spark App") \ .set("spark.master", "local[4]") \ .set("spark.ui.port", "36000") sc = SparkContext(conf=conf) Note from the Author or Editor: Thanks for catching this, I've fixed it in our dev version.	Tatsuo Kawasaki	Jun 16, 2015
PDF	Page 142 Example 8-3	READ: JavaSparkContext sc = JavaSparkContext(conf); SHOULD READ: JavaSparkContext sc = new JavaSparkContext(conf); Note from the Author or Editor: Thanks, I've fixed this in atlas.	Tatsuo Kawasaki	Jun 16, 2015
PDF	Page 145 United States	spark.[X}.port explanation is missing the ui value (spark.ui.port) Note from the Author or Editor: Yup we can configure the UI this way too, updated the table.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 146 Example 8-7	The code example: scala> val tokenized = input. \| map(line => line.split(" ")). \| filter(words => words.size > 0) on my machine (Spark 1.2.1, Scala 2.10.4, Ubuntu 14.04) gives the following: scala> tokenized.collect() res1: Array[Array[String]] = Array(Array(INFO, This, is, a, message, with, content), Array(INFO, This, is, some, other, content), Array(""), Array(INFO, Here, are, more, messages), Array(WARN, This, is, a, warning), Array(""), Array(ERROR, Something, bad, happened), Array(WARN, More, details, on, the, bad, thing), Array(INFO, back, to, normal, messages)) Note there are two one-element arrays with a single empty string in each -- these are for the two empty lines. Apparently the filter on "words.size > 0" does not give the expected result, since an empty line, split on " ", give an array of length 1 with one empty element, rather than an array of length 0. So it doesn't filter. The result of collect()ing on counts is: scala> counts.collect() res0: Array[(String, Int)] = Array((ERROR,1), (INFO,4), ("",2), (WARN,2)) in my file, each empty line is just a newline character. Note from the Author or Editor: This is a good catch, since split on an empty string returns an array with a single element the result isn't what we want. Swapping the order of the map/filter does what we want. Fixed in 0374336d16ebb32ca3452b37c7bb1642ca0755a3.	Wayne M Adams	Mar 10, 2015	Mar 27, 2015
PDF	Page 147 United States	"To trigger computation, let’s call an action on the counts RDD and collect() it to the driver, as shown in Example 8-9" might read better as "To trigger computation, let’s call an action on the counts RDD BY collect()ING it to the driver, as shown in Example 8-9." Note from the Author or Editor: Does sound better, thanks updated in dev.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 156 United States	Example 8-11 coalesces to 5, but the number of partitions is listed as 4.	Justin Pihony	Apr 27, 2015	May 08, 2015
Printed	Page 157 6th line down of text	Extra "this". Note from the Author or Editor: Thanks for catching this, fixed in atlass.	Jim Williams	Apr 07, 2015	May 08, 2015
PDF	Page 157 1st paragraph	READ: -Dsun.io.serialization.extended DebugInfo=true SHOULD READ: -Dsun.io.serialization.extendedDebugInfo=true Note from the Author or Editor: Thanks, I've fixed this in atlas :)	Tatsuo Kawasaki	Jul 20, 2015
PDF	Page 162 2nd Paragraph of section called "Linking with Spark SQL"	Text page 162, PDF page 180, of the 1 April edition contains the following fragment, with duplicated reference to Hive query language: "...and the Hive query language (HiveQL). Hive query language (HQL) It is important..." Note from the Author or Editor: Thanks for catching this, I think this was from an indexing tag that accidentally got included in the text. I've changed this in atlass and it should be removed in the next update.	Wayne M Adams	Apr 02, 2015	May 08, 2015
PDF	Page 162 United States	Hive query language is parenthesized twice, once as (HiveQL) and another as (HQL). This should probably be made common. In fact, searching the entire book, there are a handful of HQL references, but it is mostly used as HiveQL. Again, this might need changed if you want more consistency? Note from the Author or Editor: Thanks, I've made this consistent in our atlas version.	Justin Pihony	Apr 27, 2015
PDF	Page 163 Table	Table 9.1 lists the Scala and Java types/imports for Timestamp. java.sql.TimeStamp should be java.sql.Timestamp Note from the Author or Editor: Fixed.	Anirudh Koul	Feb 03, 2015	Mar 27, 2015
PDF	Page 165 United States	Example 9-8 does not match 9-6 and 9-7 in that it does not show the creation of the SparkContext. Note from the Author or Editor: Unified.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 179 United States	Example 9-39 has a collect and println, whereas 9-36 and 9-37 do not. Note from the Author or Editor: Removed the println from the java example (done in atlas).	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 181 Table 9-2	Default 'spark.sql.parquet.compression.codec' property is gzip. https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L251 val PARQUET_COMPRESSION = enumConf("spark.sql.parquet.compression.codec", valueConverter = v => v.toLowerCase, validValues = Set("uncompressed", "snappy", "gzip", "lzo"), defaultValue = Some("gzip"), Note from the Author or Editor: This is correct (although in the current version of Spark it's now snappy). I've updated the table to mention this.	Tatsuo Kawasaki	Jun 27, 2015
PDF	Page 185 United States	Example 10-5 @override is missing on the call method. Note from the Author or Editor: Added missing @override	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 195 United States	In Figure 10-7, An arrow is missing on the inverse graph from {4,2} to 20 Note from the Author or Editor: On the right side of the graph we need to add an arrow from the {4,2} box to 20 and remove its line to box 22.	Justin Pihony	Apr 27, 2015	May 08, 2015
PDF	Page 196 Third line from top	The example given is: reduceByKeyAndWindow( new AddLongs(), // Adding elements in the new batches entering the window new SubtractLongs() // Removing elements from the oldest batches exiting the window Durations.seconds(30), // Window duration Durations.seconds(10)); // Slide duration Well there should be comma after new SubtractLongs() i.e. it should be new SubtractLongs(), The below error is already reported by 'Jongyoung Park ' but the page number mentioned is wrong. it should be 196 and not 198: In Example 10-22 Durations is misspelled as Dirations. Note from the Author or Editor: Thanks for point this out, I've added the missing comma.	Anonymous	Jun 02, 2015
PDF	Page 198 Example 10-22	In last two lines, "Durations" is misprinted as "Dirations". Note from the Author or Editor: This is correct, I've fixed this in atlas.	Jongyoung Park	May 31, 2015
PDF	Page 199 United States	"...it needs to have a consistent date format..." I am pretty sure this should be datA format Note from the Author or Editor: fixed.	Justin Pihony	Apr 28, 2015	May 08, 2015
PDF	Page 201 United States	Example 10-32 uses a helper with a map for the final action, whereas 10-33 simply calls print Note from the Author or Editor: There was a difference, I've made both just print.	Justin Pihony	Apr 27, 2015	May 08, 2015
Mobi	Page 10409 Example 10-34	There are couple of naming errors in Scala version of example for newer (as of Spark 1.3) Spark Streaming and Apache Kafka createDirectStream() method (Example 10-34: Apache Kafka directly reading Panda's topic in Scala): - brokers val should be renamed to kafkaParams (brokers is referenced inside Map.apply() method); - topics val should be renamed to topicsSet (topicSet is referenced inside createDirectStream() method); Note from the Author or Editor: Thanks for catching this, I've fixed it it in the latest dev version.	Ivanov Vladimir	Apr 22, 2015	May 08, 2015