Errata

Data Algorithms

Errata for Data Algorithms

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF Page xxvi
end of first parahraph

The line,

"For example, if DNA-Sequencing1 takes 60
hours with three servers, then by "scaling out" the solution might produce
the same DNA-Sequencing with 50 similar servers in less than 2 hours."

Should say

"For example, if DNA-Sequencing1 takes 60
hours with three servers, then by "scaling out" the solution might produce
the same DNA-Sequencing with 50 similar servers in less than 4 hours."

Reason: 60 hours on 3 servers is 180 server hours. We can hope to achieve the same amount of work done by 50 servers in approximately 4 hours, or 100 servers in 2 hours.

Manoj Agarwal  Nov 23, 2014 
Printed Page 3
2nd bullet point

Page 3 (second bullet point) refers to Java Code Geeks, for Secondary Sorting. I think this should be attributed to "Hadoop: The Definitive Guide by Tom White" as this was widely publicized by him. Even the Java Code Geeks link says this, see Resources section.

Anonymous  May 19, 2016 
Printed Page 4
Example 1-1. DateTemperaturePair class

Page 4: Example 1-1. The DateTemperaturePair class is defined as "
public class DateTemperaturePair implements Writable, WritableComparable<DateTemperaturePair> {
........................
}
There is no need to implement "Writable"separately as "WritableComparable" already extends it. See WritableComparable Doc at http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html for more details

Anonymous  May 19, 2016 
Printed Page 6
Code line immediately above Data Flow Using Plug-in Classes

job.setGroupingComparatorClass(YearMonthGroupingComparator.class)

should instead read:

job.setGroupingComparatorClass(DateTempuratureGroupingComparator.class)

Note that there is no YearMonthGroupingComparator class. The code found in GitHub shows this correctly:

https://github.com/mahmoudparsian/data-algorithms-book/blob/master/src/main/java/org/dataalgorithms/chap01/mapreduce/SecondarySortDriver.java

Todd Farmer  May 12, 2016 
Printed Page 7
Figure 1-2. Secondary sorting data flow

The output of partition() shows data for YearMonth value of 2000-11 appearing in both partitions. The DateTemperaturePartitioner class partitions by the YearMonth value, and should result in pairs with the same YearMonth value routed to the same partition.

Todd Farmer  May 12, 2016 
PDF Page 47
Chapter 2, Top-10 List

The described parallelisation approach has a fundamental flaw. Constructing a global top-N from a series of local top-N's might not result in the correct output when members of the global top-N are not present in some (or all) of the local top-N lists.

To illustrate with a very simple example of a top-2 calculation based on the following local top-3 lists.
top-3 list 1:
A, 5
B, 4
C, 3

top-3 list 2:
D, 5
E, 4
C, 3

The global nr 1 key is C with a value of 6, but if we'd take the local top-2 lists only, C would be left out entirely.

See also this discussion on stackoverflow: http://stackoverflow.com/questions/15613966/parallel-top-ten-algorithm-for-distributed-data

Robbert Zijp  Aug 24, 2014 
Printed Page 260
2nd paragraph

the 1st bullet point
"Give that today is foggy, what is the probability that it will be rainy two days from now?"

The problem asks S3 to be "Rainy" - but the solution given in the text after the above line - is done with S3 to be "Foggy"

Sumit Pal  Feb 28, 2016 
PDF Page 687
3rd bullet item, starting with 'It does not allow false negative errors'

There is an error in this sentence:
'This means that if x is /not/ in the set, then for sure it will indicate that x is not in the set.'
This should be:
'This means that if x is in the set, then for sure it will /not/ indicate that x is not in the set.'

The original sentence is also contradicting the previous bullet about false positive errors, which are allowed: 'This means that for some x, which is not in the set, Bloom filter might indicate that x is in the set.'
In both the 2nd and the 3rd bullet the situation is described that x is not in the set.
- According to the 2nd bullet, a bloom filter might report that x is in the set,
- but according to the 3rd bullet the bloom filter in the same case will never report that x is in the set

Robbert Zijp  Aug 24, 2014