Errata

Data Science at the Command Line

Errata for Data Science at the Command Line

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF, ePub Page X
2nd paragraph of "What to Expect from This Book" (within Preface)

Preface / What to Expect from This Book, 2nd chapter:

"while others will be replace by better ones" should read "while others will be replaced by better ones"

Jochen Hayek  Aug 19, 2014 
Chapter 5
Section "Common Scrub Operations for Plain Text", subsection "Based on pattern"

Chapter 5, section "Common Scrub Operations for Plain Text", subsection "Based on pattern", states at the very end:
«Note that you have to specify the -E option in order to enable regular expressions. Otherwise, grep interprets the pattern as a literal string.»

That, I fear, is completely off the mark.
grep, without additional arguments, evaluate the pattern provided as a Basic Regular Expression, not as a literal.
The -E ( --extended-regexp ) switch, or invoking it as egrep, merely uses Extended Regular Expressions instead of Basic Regular Expressions.

In order to interpret the pattern as a literal string, one needs to use the -F (--fixed-strings) switch, or to invoke grep as fgrep.

This is a basic and important notion about the way grep works which ought to be rectified as soon as possible.

Fulvio Scapin  Nov 09, 2014 
1
Executing a Command-Line Tool 6th paragraph

The text reads:

> A long command can be broken up with either a backslash (\) or a pipe symbol (|) .


I don't think a `pipe` can be used to "break up" a command the way the backslash does, this may confuse some readers who are not comfortable with Bash.

Andres Lowrie  Jan 30, 2018 
PDF Page 11
line 11

"There are a few command-line tools that require the complete data before they write any data to standard output, like sort and awk" – that is simply untrue for awk – would you please remove awk from that list – awk is a classic Unix filter utility, it certainly does not wait for the end of the input in order to process all of its input.

Jochen Hayek  Apr 08, 2015 
PDF Page 16
Side note near top of page

URL is given as "http://datasciencatthecommandline.com" - the "e" in "science" is missing.

Anonymous  Aug 11, 2014 
PDF Page 16
Exampe 202

The 'elaborate' Vagrantfile described here is insufficient to test out the examples provided in the book.

In Chapter 3 Obtaining Data the author demonstrates how to use cURLs to pull data from the Internet. But I could not access the Internet from my virutal machine using this Vagrantfile.

The author does not tell the user how to configure Vagrant to access the Internet. I had to search for an answer.

I added the following to my Vagrantfile:

vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]

This is a major omission. If you feel the need to tell the user how to use the -pwd command, you should certainly ensure that the user has Vagrant configured to follow the examples.

Anonymous  Dec 14, 2014 
ePub Page 23
Section 2.3 InfoBox

Poorly worded sentence. Says: "We will only explain the concepts and tools that are relevant for to doing data science."

Should say: "We will only explain the concepts and tools that are relevant to doing data science"

Anonymous  Oct 15, 2019 
PDF Page 36
Infobox

In the infobox on page 36 in subchapter "Converting Microsoft Excel Spreadsheets" an alternate solution for *in2csv* is described as opening the spreadsheet in LibreOffice Calc. Maybe it is worth mentioning, that LibreOffice has some sort of command line mode when called with the "headlesss" parameter from the terminal.

The following command will export all Excel spreadsheets in the current folder to csv files without opening a GUI. So the listed disadvantages of the alternate solution in the infobox are not all true, except for the availability of LibreOffice on remote servers.

$ libreoffice --headless --convert-to csv *.xlsx

The lightweight Gnumeric spreadsheet program even has a more advanced command line tool named *ssconvert*, which is able to export multiple tables from one spreadsheet file. A feature that LibreOffice and *in2csv* are currently missing.

Use ascending integer for csv file name:
$ ssconvert --export-file-per-sheet tables.xlsx table-%n.csv

Use table name for csv file name:
$ ssconvert --export-file-per-sheet tables.xlsx %s.csv

Benjamin Meier  Aug 20, 2014 
PDF Page 36
4th paragraph

Of the text "contains the unwanted text and even an error message" please remove "and even an error message", as there is no error message shown at all.

Jochen Hayek  Apr 08, 2015 
PDF Page 46
4th paragraph of Step 3: Define Shebang

Shebangs always look like #!..., the "#" is missing for each example.

Jochen Hayek  Apr 08, 2015 
PDF Page 56
First code example

The -e option is needed for the echo command to work as indicated.

You have:
echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr

The command should be:
echo -e 'foo\nbar\nfoo' | sort | uniq -c | sort -nr

to get the result shown:
2 foo
1 bar

Anonymous  Dec 14, 2014 
PDF Page 56
First and Second line of code

The code printed states:
echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr

However, I believe for echo to interpret backslash escapes it must have -e flag like such:
echo -e 'foo\nbar\nfoo' | sort | uniq -c | sort -nr

Joe Lotz  Dec 26, 2014 
PDF, ePub Page 60
United States

Missing chapters 5 on. When will these be available? I thought I was receiving the complete book.

Dennis Barnes  Jul 21, 2014 
ePub Page 61
last code snippet of Step 4

The ePub snippets is
$ cat data/ | ./top-words-4.sh

According to the paragraph above, it should be

$ cat data/finn.txt | ./top-words-4.sh

Sébastien Portebois  Oct 12, 2014 
PDF Page 73
First two examples

The -e option is required for the echo command to work.

Instead of

echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | csvcut -c $(seq 1 2 9 | paste -sd,)

the command should read

echo -e 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | csvcut -c $(seq 1 2 9 | paste -sd,)

The same goes for the second example on the page.

Butcher Pete  Oct 08, 2015 
Printed Page 105
second command of the page

The original command specified returns an error,
<data/immigration.csv csvcut -c Period,Denmark,Belgium,Netherlands,Norway,Sweden|Rio -re 'melt(df, id="Period", variable.name="Country",value.name="Count")'|tee data/immigration-long.csv|head|csvlook

returning

Loading required package: tidyr
Error: could not find function "melt"
Execution halted

As melt is indeed included in the reshape2 library, once installed, the fix is to load the library in the command, as:

<data/immigration.csv csvcut -c Period,Denmark,Belgium,Netherlands,Norway,Sweden|Rio -re 'library(reshape2);melt(df, id="Period", variable.name="Country",value.name="Count")'|tee data/immigration-long.csv|head|csvlook

Nelson Gaasch  Feb 06, 2016