Errata

Data Science with Java

Errata for Data Science with Java

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
CHAPTER1
Third paragraph

On the second line, it is stated the following:

"it suffice to build a series of numerical array types (e.g. double[][], int[], String[]) to contain the data"

I don't think String[] is a numerical array type, and I'm also not sure about the notion of "building types" in the sentence since no new types are being defined. I'd rather use instances instead, which seems more precise:

"it suffice to define a series of array instances (e.g. double[][], int[], String[]) to contain the data"

Carlos G. Gavidia  Nov 13, 2015 
CHAPTER1
"Choosing a Data Model" section

It is stated that "Looping though the data file with a BufferReader" in the book. However, the proper name of the Java Data Type is "BufferedReader". The text should be:

"Looping though the data file with a BufferedReader"

Carlos G. Gavidia  Nov 13, 2015 
CHAPTER1
"Choosing a Data Model" section

On the section is stated the following: "Any methods acting on Record could be static methods ideally in their own class", however I don't see any justification for this.

If the method acts over the data contained in a Record instance field -for example, a calculation with the year- it is possible and even recommended for this method to be in the Record class.

Carlos G. Gavidia  Nov 13, 2015 
CHAPTER1
"Java Database Connectivity" section

On the first line of the section it is stated that "The Java Database Connectivity (JDBC) is a protocol for Java", but I hardly believe that JDBC qualifies as a protocol. I think is more proper to call it an API.

Carlos G. Gavidia  Nov 13, 2015 
CHAPTER1
"Saving a Plot to a File" section

On the snippet there's a comment stating "save the chart to a file AFTER the stage is rendered ". How can I identify when the Stage is rendered from a Java Program? Should I wait a few seconds, or override a particular method?

Carlos G. Gavidia  Nov 14, 2015 
Printed, PDF, ePub Page II
Link to Code on GitHub

Hi,

I could not find the code in the GitHub repository that was listed in the book.https://github.com/oreillymedia/Data_Science_with_Java

All I see is the readme file.

Thanks,
Venkat

Anonymous  Jun 08, 2017 
1
Chapter Data I/O -> Data Models -> Data Objects -> 3rd paragraph -> 1st sentence

There is a probable typo in chapter Data I/O -> Data Models -> Data Objects -> 3rd paragraph -> 1st sentence.
The second word "though" in the following sentence is probably misspelled and should correctly be "through".
"Looping though the data file with a BufferReader, each line can then be parsed and its contents stored in a new Record instance."

Ralph  Jun 12, 2017 
PDF Page 8
"Managing Data Files"

The book states: "As a bare minimum, the entire conents of the file can be read into a String type using a FileReader instance..."

FileReader instance can only be used if (a) file is a character file, not binary data (b) more importantly, if the encoding of the file is the default character encoding. If encoding of the data file (as it commonly happens) differs from the default character encoding, InputStreamReader should be used.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 8
"Managing Data Files"

The book states: "by using a BufferedReader where each line of the file is read separately."

This is wrong. BufferedReader provides a possibility to read file line by line, but it's not true that it reads file line by line itself. First, the buffer size may be less than the length of the line. Second, it has methods to read arbitrary number of characters.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 8
"Understanding the File Structure"

The book says: "Recall that ascii files are just a collection of ascii characters printed to each line."

There is no concept of "line" in ASCII file. ASCII file is just a collection of ASCII characters, that is it. The notion of "line" is platform-specific, and the lines can be represented, for example, by LF (0xA), CR (0xD) or combination of those.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 9
After the second code example

The fact that 'wc -l somefile.txt' returns 1025 does not mean that there are 1025 lines in the file (header + 1024 lines of data). The output of 'wc' depends on the file format: if the last line ends with newline character (as it often happens and sometimes even considered to be a good practice), the output will be one more than the number of lines in the file.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 10
last code example

Catching generic Exception is a very bad practice, actually, famous anti-pattern in Java. The code should catch IOException thrown by readLine, or do nothing at all.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 10
3rd paragraph

The book states:
"Any methods acting on Record could be static methods ideally in their own class titled something like RecordUtils."

This is a very strange idea. Java is object-oriented language, and encapsulation is one of the important principles of object-oriented design. Encapsulation tells us that the data should be bundled with methods.

Please refer, for example, to the article in Java World (http://www.javaworld.com/article/2075271/core-java/encapsulation-is-not-information-hiding.html) where the author shows an example of having the data in Position class and methods in PositionUtility class, and proceeds with explaining why it's not the way things should be designed. To quote the article: "Though the code may use Java objects, it does so in a manner reminiscent of a by-gone era: utility functions operating on data structures. Welcome to 1972! "

Alexey Vyskubov  Nov 16, 2015 
PDF Page 11
First code example

If the first br.readLine() throws IOException, the catch block will try to print out 'line' which is not yet defined; instead it should print 'header'.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 12
"Delimited Strings", plus code example on the next page

The book says:

"Considering the popularity of spreadsheets and database dumps it is highly likely you will be given a “comma separated values” (CSV) dataset at some point. Parsing this kind of file could not be easier! "

The CSV format is extremely hard to parse. Completely broken code example on the next page demonstrates this.

The main problem is that there's no CSV format standard, and each piece of software does things a bit differently.

Of course, there is RFC 4180, which describes de facto CSV format.

The code on the next page of the book (page 13), which is supposed to demonstrate how easy is to parse CSV format, do the following things wrong (w/respect to RFC4180):

"Spaces are considered part of a field and should not be ignored." -- the code trims whitespace from "city" field.

"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes." -- the code completely breaks for any such non-trivial field. Commas are discussed later, but not new lines and not escaped quotes inside fields.

While author indeed talks next about commas in the fields, the original statement ("could not be easier") is very misleading. I'd propose to remove everything starting with "could not be easier" sentence and until the discussion of Commons CSV (which is a proper thing to teach, respect to the author for bringing it up!). Inserting some sentence along the lines of "parsing file formats can be tricky, before doing anything yourself check if there is a ready-made library for it" could be helpful as well.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 13
2nd-3rd non-code line from the bottom

"This is quite tricky to parse and requires regex."

I'd propose to remove any mention of regex from this part of the text. I think most of the readers of this book will belong to one of two categories: people, who don't know what "regex" is, and people, who already know that "," inside line.split(",") in the previous code example, was, actually, a regex already, albeit a very simple one.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 14
Code example on top

In the example of the usage of Commons CSV the author unnecessary complicates the things, handling the reading from the file instead of using the library.

Instead CSV parser should be created using one of the factory methods described in the documentation for the CSVParser class, and the code should be built the way documentation recommends:

File csvData = new File("/path/to/csv");

CSVParser parser = CSVParser.parse(csvData, CSVFormat.RFC4180);

for (CSVRecord csvRecord : parser) {

...

}

Alexey Vyskubov  Nov 16, 2015 
PDF Page 14
JSON Strings, 1st sentence

"Javascript Object Notation (JSON) is a protocol"

It's not a protocol, it's a format.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 15
last sentence before code example + code example itself

"It is straight forward to build our dataset now using org.simple.json" -- what is "org.simple.json"? Google doesn't find anything like that. I suspect the author means JSON.simple (https://code.google.com/p/json-simple/)

Related code example has the following major problems:

1. If indeed the author means to use JSON.simple, the first two import lines are wrong. They should be:

import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;

(notice the order of json and simple, and "parser" before JSONParser)

2. The code won't compile in any case, because it defines JSONObject called 'obj' but tries to use something undefined called 'j'.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 16
code in "Writing to a File" section

The code in the book says:

/* or feed in an Iterator */
String newString = String.join(",", myList);

Please notice, that String.join takes Iterable<T>, not Iterator<T> as the second parameter, so the comment is wrong.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 17
Second sentence after the topmost code example

"In any case, the strings are written line by line."

I don't understand the meaning of this phrase.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 17
First sentence after topmost code example

The book says: "Note that successively using string += string_part calls the StringBuilder class, so you might as well use StringBuilder anyway (or not). "

Using something like "string += string_part" in a loop is a mistake. The string concatenation indeed uses StringBuilder class, *creating and destroying new instance of StringBuilder in each iteration of the loop*, which may create quite a lot of overhead.

When building string from parts in a loop, one always must use StringBuilder.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 17
Second code example

Instead of doing

bw.write(s + "\n");

one should do

bw.write(s);
bw.newLine();

Please refer to the documentation for BufferedWriter class for details.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
Code example

for(){
/* adds a new line for you! */ pw.println("my data");
}

What's going on here? What's this loop?

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
First complete sentence on the page

"This could also be useful if you are generating text files on your own com‐ puter (and therefore OS) and will be comsuming these files yourself."

Is it possible to elaborate? I don't understand the logic here.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
Second sentence after "Mastering Database Operations" header

The book says: "...distributed file systems like Hadoop"

Hadoop is not a distributed file systems. Hadoop, put simply, consists of HDFS (which *is* a distributed file system) and MapReduce (which is a processing part).

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
1st paragraph after "Mastering Database Operations" header

The books says: "Indeed, popular tools built on top of MongoDB or Hadoop such Apache Drill, Apache Hive or Cloudera Impala..."

I don't think that "built on top" is the right way to present things. Apache Drill, for example, supports interfacing with MongoDB or Hadoop, but is not built on top of either of those, as far as I know.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
1st paragraph after "Mastering Database Operations" header

"From the other end, Postgresql has added the ability to store and query unstructured data with the addition of the JSON type. MySQL is not far behind. "

MySQL supports JSON data type since version 5.7.8, released half a year ago. I don't think "not far behind" is the right way to express this fact.

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
Table 1-1

I don't quite understand the missing mark for "jdbc" in "MongoDB" row. There are multiple jdbc drivers for MongoDB. For example: http://www.unityjdbc.com/mongojdbc/mongo_jdbc.php

Alexey Vyskubov  Nov 16, 2015 
PDF Page 18
second line from the top

What's that (root:" ")? I guess the author tried to say that we consider the case of admin user with empty password, but I'm not sure.

Alexey Vyskubov  Dec 25, 2015 
PDF Page 20
Last paragraph on the page

I think on Mac no amount of "ls -alh"ing would find the proper directory. The way to run ij on Mac could be:

$ $(/usr/libexec/java_home)/db/bin/ij

java_home finds the right directory, something like /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home

Following the symlinks from /usr/bin/javac will lead to /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/javac instead, and it seems there's no ij nearby.

Alexey Vyskubov  Dec 25, 2015 
PDF Page 21
top of the page

While ij can seemingly be run with given java command line (java -cp .../derbytools.jar org.apache.derby.tools.ij) it's a very bad idea, at least on Mac.

Compare:

(Proper way to run ij)

$ $(/usr/libexec/java_home)/db/bin/ij
ij version 10.11
ij> connect 'jdbc:derby:memory:MyDbTest;create=true';
ij>

(calling it with given command line)

$ java -cp /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/db/lib/derbytools.jar org.apache.derby.tools.ij
ij version 10.11
ij> connect 'jdbc:derby:memory:MyDbTest;create=true';
ERROR 08001: No suitable driver found for jdbc:derby:memory:MyDbTest;create=true
ij>

The 'ij' script seem to do much more than just add derbytools.jar to classpath.

Alexey Vyskubov  Dec 25, 2015 
PDF Page 26
1st paragraph

...known as "try with resource"...

It should be called "try-with-resources" (see official documentation: http://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html)

Alexey Vyskubov  Jan 03, 2016