Hadoop: The Definitive Guide

Errata for Hadoop: The Definitive Guide

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date Submitted
Safari Books Online ?
"Anatomy of a File Write" when describing HDFS dataflow

First, sorry that I couldn't give the page number as I am reading the book over Safari and I couldn't see a page number. This is just a minor issue that's located under the "3. The Hadoop Distributed Filesystem" section and in the "Data Flow" subsection when describing the "Anatomy of a File Write". And the issue is in the paragraph starting with "As the client writes data (step 3)". In this paragraph it says that "The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. ". I think it would be technically more correct if this sentence says "min replication level" instead of "replication level", because during a file write only "min replication number" of nodes form a pipeline and get written synchronously (according to the figure number 4 shows a synchronous write pipeline), the remaining replicas (that is replication level - min replication level) are updated asynchronously after the write succeeds. In fact this is already mentioned in the following paragraphs in this section. So this is just a minor issue to make the sentence less confusing as when it says "replication level" the reader can easily take it as the value of the "dfs.replication" parameter while this sentence really means the value of the "dfs.namenode.replication.min" parameter. Nezih

Nezih Yigitbasi  Jul 18, 2015 
Printed Page 40
1st command

The command for streaming using Ruby files names the full path of the mapper, combiner, and reducer. The command seems to work only when the base names are used. % hadoop jar /usr/hdp/ -files ch02-mr-intro/src/main/ruby/max_temperature_map.rb,ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb -input input/ncdc/all -output output -mapper max_temperature_map.rb -combiner max_temperature_reduce.rb -reducer max_temperature_reduce.rb

Jonathan Giddy  May 16, 2016 
Printed Page 74
1st paragraph

In the "Replica Placement" section, the author states: "Hadoop’s default strategy is to place the first replica on the same node as the client [...]. The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random." According to the official documentation, this was true for Hadoop version r1.2.1: "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack". However, since version 2.4.1, the HDFS Architecture documentation reads as follows: "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack". Considering that the fourth edition covers "Hadoop 2 exclusively" (2.5.1?), It seems like the replica placement strategy depicted by the book is no longer true, unless the cited documentation is wrong. References: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replica+Placement%3A+The+First+Baby+Steps http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Replica_Placement:_The_First_Baby_Steps

Juan Sebastian Cadena  Sep 02, 2015 
PDF Page 185
1st paragraph

The last sentence of this paragraph, "If a job fails, JobControl won't run its dependencies.", maybe incorrect. I doubt it should be: "If a job fails, JobControl won't run the jobs depending on it."

sandbox wang  Nov 05, 2015 
Printed Page 249
Table 9-2, 10th Row, 2nd Column, First line.

The description for REDUCE_OUTPUT_RECORDS first line is as follows. The number of reduce output records produced by all the maps in the job. Its technical mistake the line has to be as follows. The number of reduce output records produced by all the reducer's in the job.

C Raja  Nov 09, 2015 
Printed, PDF, ePub Page 268
Joins, First paragraph 3rd line

Crunch is misspelled as Cruc in following line higher-level framework such as Pig, Hive, Cascading, Cruc, or Spark.

Gaurav Bhardwaj  Nov 27, 2015 
Printed Page 508
3rd paragraph from the bottom

In this query: select station, year, avg(max_temperature) from ( select station, year, max(temperature) as max_temperature ... group by station, year ) mt group by station, year; The subquery produces a single (station, year, max_temperature) record for each (station, year) grouping ... so the outer select computes the "average" of a single temperature. Or am I missing something?

William  Jan 01, 2016 
PDF Page 554
Java code, records.filter anonymous class

The program in Java compares strings using the != operator, which will not work unless the strings in all the records in the RDD are interned. It should be ! rec[1].equals("9999") instead.

RealSkeptic  Jan 09, 2017