Errata

Hands-On Unsupervised Learning Using Python

Errata for Hands-On Unsupervised Learning Using Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Printed, PDF
Page chapt 6
bloc of code In [6]


In chapter 6 (early online version of the book --> no page number...) , I think the regular expression is not correct in this bloc of code :

# Transform features from string to numeric
for i in ["term","int_rate","emp_length","revol_util"]:
data.loc[:,i] = \
data.loc[:,i].apply(lambda x: re.sub("[^0-9]", "", str(x)))
data.loc[:,i] = pd.to_numeric(data.loc[:,i])

If I'm not wrong (I checked this with R, not Python), with this approach, you are removing the decimal separators (points).
The consequence is minor for the variable "int_rate" because there is always 2 numbers after the points (the % are just multiplicated by 100).
But for "revol_util" you have for example 19% that will become 19 while 2.10% will become 210.

The point should be included in the brackets in the regex. With R syntax this would be (sorry I'm not certain of the Python syntax) :

txt2num <- c("term","int_rate","emp_length","revol_util")
for (i in txt2num) {
d[,i] <- as.numeric(gsub("[^0-9\\.]", "", d[,i]))
}

Note also that with this approach, in the variable " emp_length" the category "< 1 year" and "1 year" will both be transformed into the numeric value "1". Maybe it would be more appropriate to transform "< 1 year" into "0.5" to keep these value separated.

The consequence of this is probably minor for the demonstration of the technique in this chapter.
Thanks for the interesting book by the way...

Note from the Author or Editor:
You are right. The percentages are treated inconsistently for "int_rate" and "revol_util". But, all the features are transformed into numeric and then scaled, so the features (after performing scaling) are the exact same. We will keep the code as is for now, but we will note this for any substantial future revisions we make to this notebook. Thanks!

Gilles San Martin  Feb 22, 2019  Mar 06, 2020
Printed
Page 28
2nd line of code

$ git lfs pull
instruction doe not exist

"(base) ~ >git lfs pull
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
log

Note from the Author or Editor:
Please add a "$pip install git-lfs" command before "$git lfs install".

The new code on page 28 should read:
$ git clone https://github.com/aapatel09/handson-unsupervised-learning.git
$ pip install git-lfs
$ git lfs install
$ git lfs pull

Ernesto Belmont  Mar 05, 2021  May 21, 2021
Printed
Page 28
5th line of code

sais

$ activate unsupervisedLearning

should say

$ conda activate unsupervisedLearning

Note from the Author or Editor:
current:

$ activate unsupervisedLearning

should be:

$ conda activate unsupervisedLearning

[on page 28.]

Ernesto Belmont  Mar 05, 2021  May 21, 2021
Printed
Page 30
before Overview of data

In my Mac with 11.2.2 , libomp is missing so it is needed to install

> brew install libomp

Note from the Author or Editor:
In the "Interactive Computing Environment: Jupyter Notebook" section on page 30, please add a Note box that says the following:

"On Mac, you may need to install libomp before running $jupyter notebook. Use the following command to install libomp: $brew install libomp"

Ernesto Belmont  Mar 05, 2021  May 21, 2021
Printed
Page 37
8th code line

Using python 8.5 on Mac with MacOS 11.2.2.

I can't generate the data plot of page 38

I got the error

Traceback (most recent call last):
File "proyecto0.py", line 88, in <module>
ax = sns.barplot(x="count_classes.index", y="tuple(count_classes/len(data))")
--------------------------------
raise ValueError(err)
ValueError: Could not interpret input 'count_classes.index'

Note from the Author or Editor:
Please replace the code block on page 37 with the following (note the change in the second line of code):

count_classes = pd.value_counts(data['Class'],sort=True).sort_index()
ax = sns.barplot(x=count_classes.index, y=[tuple(count_classes/len(data))[0],tuple(count_classes/len(data))[1]])
ax.set_title('Frequency Percentage by Class')
ax.set_xlabel('Class')
ax.set_ylabel('Frequency Percentage')

Ernesto Belmont  Mar 05, 2021  May 21, 2021
Printed
Page 39
5th paragraph

on last line
pij
shuld be
p normal sub i,j

Note from the Author or Editor:
Please change subscript "pi, j" in the last line of the 5th paragraph on page 39 to normal "P" subscript "i,j".

Ernesto Belmont  Mar 05, 2021  May 21, 2021
Printed
Page 44
2nd para

"using confusion matrix would be useful" probably want to mean
"using confusion matrix would not be useful"

Note from the Author or Editor:
Given that our credit card transactions dataset is highly imbalanced, using the confusion matrix would be meaningful.

NEEDS TO BE THE FOLLOWING:
Given that our credit card transactions dataset is highly imbalanced, using the confusion matrix would not be meaningful.

scott schmidt  Apr 14, 2019  May 03, 2019
PDF
Page 44
Paragraph "Precision-Recall Curve

The PDF sais: Precision = TP / (TP + FN)
But this is wrong.
Right is:
Precision = TP / (TP + FP)

Note from the Author or Editor:
Yes, confirmed. I just fixed the language in the book.

Philip May  Aug 16, 2019  Mar 06, 2020
Printed
Page 45
recall equation

denominator (true positive / false positive)
should be
(true positive / false negative)

Note from the Author or Editor:
Recall = True Positives / (True Positives + False Positives)

SHOULD BE:

Recall = True Positives / (True Positives + False Negatives)

scott schmidt  Apr 14, 2019  May 03, 2019
PDF
Page 45
Paragraph Precision-Recall Curve

The PDF sais:
Recall = TP / (TP+FP)
That is wrong.
Right is: Recall = TP / (TP+FN)

Note from the Author or Editor:
I corrected this in the book. Thanks.

Philip May  Aug 16, 2019  Mar 06, 2020
Printed
Page 51
Figure 2-6 description text

Figure 2-6 description at top of page has an unnecessary quotation mark:

‘Figure 2-6. Precision-recall curve of random fores”ts’

Note from the Author or Editor:
Yes, please correct.

Tim Hutchinson  May 30, 2020  May 21, 2021
Printed, PDF
Page 59
caption for figure 2-15

The caption for figure 2-15 is: "Test set auROC curve of logistic regression"
Should be replaced by: "Test set auROC curve of random forests"

Note from the Author or Editor:
I made the correction.

Frank Langenau  Dec 29, 2019  Mar 06, 2020
Printed, PDF
Page 109
last line

representaion => representation
(missing "t")

Note from the Author or Editor:
Made the correction.

Frank Langenau  Dec 29, 2019  Mar 06, 2020
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 143
Last paragraph

There is a typo.

In the beginning of the explanation about DBSCAN, we have "within a certian distance". This should be "within a certain distance".

Ankur A. Patel
Ankur A. Patel
 
Jul 08, 2021  Dec 10, 2021
Printed, PDF
Page 207
2nd line of code

Coefifcient => Coefficient

Note from the Author or Editor:
I made the correction.

Frank Langenau  Dec 29, 2019  Mar 06, 2020
PDF
Page 240
3rd para in "Matrix Factorization"

The underscores in the expression R = H__W should be replaced by a dot in the middle of the line because it obviously should denote the matrix multiplication of the two matrices.

Note from the Author or Editor:
Correct, I made the correction.

Frank Langenau  Dec 28, 2019  Mar 06, 2020
Printed, PDF
Page 244
6th para from top

"This W_h0+vb... " should be "This W*h0+vb..."

Note from the Author or Editor:
Correct. I made the correction.

Frank Langenau  Dec 29, 2019  Mar 06, 2020
Printed, PDF
Page 244
End of 7th para from top

The last part of the last sentence in the 7th para is "RBMs are minimizing the probability distribution of the original input form the probability distribution of the reconstructed data."
I think, there is something missing after "minimizing". Should be: "...RBMs are minimizing the divergence between the probability ..." or so.

Note from the Author or Editor:
I made the correction.

Frank Langenau  Dec 29, 2019  Mar 06, 2020
Printed, PDF
Page 295
Last paragraph of the page

"The initial loss of the discriminator fluctuates ..." should be "The initial accuracy of the discriminator fluctuates ...", because the sentence ends with "... but remains considerably above 0.50." (The loss drops below 0.5.)

Note from the Author or Editor:
I made the correction.

Frank Langenau  Jan 03, 2020  Mar 06, 2020
Printed
Page 310
2nd paragraph

"The distribution is shown in Figure 13-5." - except Figure 13-5 is something else. So the distribution of classes is not shown (the numbers are not printed either)

Note from the Author or Editor:
Thanks. I removed the block of code from the book. The counts are shown in Figure 13-5, and I clarified the language to make this clear.

Joseph Schwarzbach  Jun 19, 2019  Mar 06, 2020
Printed, PDF
Page 316
1st line

The first line of this page is the same as the second to last line on page 315 and should be deleted.

Note from the Author or Editor:
I made the correction.

Frank Langenau  Jan 04, 2020  Mar 06, 2020
Printed, PDF
Page 316
First line after the first code block

The line
"The adjusted Rand index on the training set ..." must be
"The adjusted Rand index on the test set..."

Note from the Author or Editor:
I made the correction.

Frank Langenau  Jan 04, 2020  Mar 06, 2020