Errata

Practical Statistics for Data Scientists

Errata for Practical Statistics for Data Scientists

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Page 213
(3rd of paragraph "Interpreting the Coefficients and Odds Ratios")

Why bother with an odds ratio rather than probabilities? We work with odds because
the coefficient βj in the logistic regression is the log of the odds ratio for Xj .

anyway we can't state that coefficient βj is the log of the odds ration for Xj, since that mean we take log of log

i think the correct statement will be "We work with odds because the coefficient βj in the logistic regression is the *change in* log of the odds ratio for Xj.

Note from the Author or Editor:
In the cited location..
EXISTING
Why bother with an odds ratio rather than probabilities? We work with odds because the coefficient βj in the logistic regression is the log of the odds ratio for Xj .
CHANGE TO (end of second sentence)
Why bother with an odds ratio rather than probabilities? We work with odds because the coefficient βj in the logistic regression is CHANGE IN the log (Odds(Y=1)) associated with a change in Xj .

Mohammed Kamal Alsyd   Jun 20, 2023 
Page Example: Web Stickiness
https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch03.html#:-:text=The%20question%20is%20whether,e.%2C%20is%20statistically%20significant.

In the sub-section Example: Web Stickiness of Permutation Test of Resampling of Chapter 3, there is a conflict between two statements as below:

S1: Page B has times that are greater than those of page A by 35.67 seconds, on average. The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant.

--> The conclusion "i.e., is statistically significant" seems to be misleading when comparing to the following statement:

S2: This suggests that the observed difference in time between page A and page B is well within the range of chance variation and thus is not statistically significant.

In brief, I think it should be "i.e., is not statistically significant." in S1.

Note from the Author or Editor:
p. 99, center paragraph, second sentence should read: "The question is whether this difference is within the range of what random chance might produce, i.e. is not statistically significant." [the "not" had been left out]

Anonymous  May 27, 2023 
Printed, PDF, ePub, Mobi, , Other Digital Version
Page Python code using the get_dummies function
NA

The behavior of the get_dummies method has changed recently. Instead of creating integer columns containing 0 and 1, the function now creates boolean columns with True and False values. This causes statsmodels model building to fail with an exception.

To revert to the original behavior, add the keyword argument `dtype=int` to the get_dummies method calls.

Peter Gedeck
 
May 24, 2023 
Page 122
first paragraph

For the grand average, sum of squares is the departure of the grand average from 0, squared, times 20 (the number of observations). The degrees of freedom for the grand average is 1, by definition.

-degree of freedom for grand average is 19 not 1. also i think the whole page need review since the code result don't match with the written text as example For the residuals, degrees of freedom is 20 (all observations can vary) while it is actually 16 not 20

Note from the Author or Editor:
The last sentence in this paragraph, "The degrees of freedom for the grand average is 1, by definition." should be eliminated, without a replacement.

Mohammed Kamal Alsyd  May 05, 2023 
Page Page 37
The second last code snippet

The R code snippet will not generate a figure similar to Figure 1-8. But the Python code snippet at the bottom of the same page will.

Note from the Author or Editor:
This is a temporary issue that was introduced in version 3.4.0 of ggplot. The ggplot developers are aware of the problem and fixed it. An updated version has not been release yet.

https://github.com/tidyverse/ggplot2/pull/5045
https://github.com/tidyverse/ggplot2/issues/5037

Jiamin Wang  Jan 05, 2023 
Page Ch 2
text

In Chapter 2 there is an external link that does not work "Fooled by Randomness Through Selection Bias" at location 1347. Please update a valid external URL.

Note from the Author or Editor:
This is referring to link https://oreil.ly/v_Q0u

The correct link is now:
https://www.priceactionlab.com/Blog/2012/06/fooled-by-randomness-through-selection-bias/

Anonymous  Mar 17, 2022 
Page ch 1
text

In Chapter 1, there is an external link that does not work: "step-by-step guide to creating a boxplot." at location 684. Please update with a valid external URL.

Note from the Author or Editor:
The link:
https://oreil.ly/wTpnE

should be replace with:
https://web.archive.org/web/20190415201405/https://www.oswego.edu/~srp/stats/bp_con.htm

Anonymous  Mar 17, 2022 
Page 44
Ordered Item 1

The writer's text statement ggplot has functions facet_wrap and facet_grid is unclear. It is unclear because the writer instructs the reader to use the function facet_grid in R but does not provide the R syntax. The Python facet_grid syntax is provided on page 45.

Note from the Author or Editor:
The example uses facet_wrap as there is only one conditioning variable. The R function facet_wrap will, by default, set the number of rows and columns in such a way that the resulting grid is close to square. In the example, this leads to a 2x2 grid. If there are two conditioning variables, you would need to use facet_grid.

In general, we recommend to consult the package documentation. The package ggplot comes with comprehensive documentation at https://ggplot2.tidyverse.org/index.html.

I'm going to add a sentence to the manuscript to highlight the fact that facet_grid would be used for two conditioning variables.

Stephen Dawson  Mar 10, 2022 
Page 19
3d bullet point of Key Ideas

Bullet point suggests that mean absolute deviation is robust which contradicts 2nd paragraph of page 16

Note from the Author or Editor:
We change the 2nd and 3rd paragraph on page 16 to:

Neither the variance, the standard deviation, nor the mean absolute deviation is fully robust to outliers and extreme values
(see <<Median>> for a discussion of robust estimates for location).
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations;
more robust is the _median absolute deviation from the median_ or MAD:

Gitlab code is updated 2021-01-04.

Anonymous  Dec 29, 2021 
Page 441
The Boosting Algorithm section, step 3

The equation for alpha_m is surely wrong as in my kindle app it is shown as
alpha_m = (log 1 - e_m)/e_m

This can't be right as it would simplify to -1

According to wikipedia section on adaboost example, I suppose the formula should be alpha_m = 1/2 * ln ((1 - e_m)/e_m)
Which would make more sense

Note from the Author or Editor:
I can confirm the issue and needs to be corrected as suggested.

- Gitlab updated to latexmath:[$\alpha_m = \frac12 \log\frac{1 - e_m}{e_m}$]

Tapani Raunio  Dec 06, 2021 
Printed
Page 217
R code block at end of page

On page 217 of the printed book (2nd edition), the R code at the end of the page reads:

terms <- predict(logistic_gam, type='terms')
partial_resid <- resid(logistic_model) + terms
df <- data.frame(payment_inc_ratio = loan_data[, 'payment_inc_ratio'],
terms = terms[, 's(payment_inc_ratio)'],
partial_resid = partial_resid[, 's(payment_inc_ratio)'])

I believe that partial_resid here should be:

partial_resid <- resid(logistic_gam) + terms

I'm not sure if the graph produced on page 218 (Figure 5-4) using this data needs correction or not, as the difference using logistic_model and logistic_gam is quite minor, and it is hard to tell comparing a screenshot and the printed page.

Note from the Author or Editor:
The line needs to be changed in the asciidoc code. It is already corrected in the book's Github repository, however I overlooked changing the book text. That is now corrected too.

Gabriel Simmonds  May 11, 2021 
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 191
Figure 4-12

I believe that Figure 4-12 on page 191 is in error because the code used to generate it (Chapter 4 - Regression and Prediction.R from the practical-statistics-for-data-scientists-master.zip file) appears to be in error.

The code states:

terms1 <- predict(lm_spline, type='terms')
partial_resid1 <- resid(lm_spline) + terms


but surely partial_resid1 should be:

partial_resid1 <- resid(lm_spline) + terms1

which would give rise to a slightly different plot?

Note from the Author or Editor:
I can confirm the error in the R code. The R code is not printed in the book, but the image created is. As mentioned in the errata, the difference in the plot is only small.

I changed the code to create the correct plot.

New figure file images/psds_0412.png added to book repository. This file will need to be processed (cropping the whitespace) to replace the file psds2_0412.png.

Gabriel Simmonds  Apr 25, 2021 
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 127
Python code end of page

Issue reported on github repository:

The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).

chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')

I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.

This is correct. Code and book text corrected.

Peter Gedeck
 
Apr 09, 2021 
PDF, ePub
Page 4
First and second bullets in the "Further Reading" section.

The link to the pandas documentation ( https://oreil.ly/UGX-4 ) results in a 404 error. The O'Reilly redirect appear to attempt to access https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes .

Note from the Author or Editor:
We need to change to redirect
https://oreil.ly/UGX-4
to redirect to:
https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
Ideally, this can be done without changing the short URL

--redirect all set (O'Reilly errata team)

Matt Slaven  Mar 29, 2021  Mar 30, 2021
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 200
predicted probabilities

A reader reported different results for the predictions from the Naive Bayes model. The change was caused by the following. In version 4 of R, read.csv no longer converts string columns automatically into factors. The old behavior can be restored by setting stringsAsFactors=TRUE .

There is no change required in the book. The GitHub repository will be updated with the change.

Peter Gedeck
 
Feb 27, 2021 
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 323
First code snippet

The output of the first line of x is incorrect and should be major_purchase instead of car.

> x
dti payment_inc_ratio home_ purpose_
1 1.00 2.39320 RENT major_purchase
2 5.55 4.57170 OWN small_business
3 18.08 9.71600 RENT other
4 10.08 12.21520 RENT debt_consolidation
5 7.06 3.90888 RENT other


gitlab code corrected

Peter Gedeck
 
Feb 22, 2021 
Printed
Page 306
Python code middle

Due to a change in one of the Python packages, the code causes an error. The following code is working:

fig, ax = plt.subplots(figsize=(5, 5))
dendrogram(Z, labels=list(df.index), color_threshold=0)
plt.xticks(rotation=90)
ax.set_ylabel('distance')


Book text changed

Peter Gedeck
 
Dec 07, 2020 
Printed
Page 302
last paragraph

"Figure 7-7 shows the cumulative percent of variance explained for the default data for the number of clusters ranging from 2 to 15."

Just a few minor things here:

- "2 to 15" should be replaced by "2 to 14"

- "default data" should be replaced by "stock data"

- Due to harmonization, the python code on the following page might be adjusted, so that range(2, 15) is used instead of range(2, 14).

Note from the Author or Editor:
All suggestions confirmed.

Book text changed.

Marcus Fraaß  Dec 06, 2020 
Printed
Page 240
2nd paragraph

Since the R code yields TRUE for the prediction knn_pred == 'paid off', the sentence

"The KNN prediction is for the loan to default."

seems to be wrong and "default" should be replaced with "be paid off".

Note from the Author or Editor:
This is correct. The sentence should read:

The KNN prediction is for the loan to be paid off.

Marcus Fraaß  Dec 06, 2020 
Printed
Page 213
7th (4th of paragraph "Interpreting the Coefficients and Odds Ratios")

Regarding the paragraph

"An example will make this more explicit.
For the model fit in "Logistic Regression and the GLM" on page 210,
the regression coefficient for +purpose_small_business+ is 1.21526.
This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.
Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans."

Suggested change:
This means that a loan to a small business compared to a loan to pay off credit card debt *increases* the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.

Best regards

Note from the Author or Editor:
The errata is correct. Gitlab document changed accordingly - PG

Marcus Fraaß  Nov 17, 2020 
Printed, PDF, ePub
Page 175
2nd paragraph

Regarding the paragraph

"Location and house size appear to have a strong interaction.
For a home in the lowest +ZipGroup+,
the slope is the same as the slope for the main effect +SqFtTotLiving+,
which is $118 per square foot (this is because _R_ uses _reference_ coding for factor variables; see 'Factor Variables in Regression').
For a home in the highest +ZipGroup+,
the slope is the sum of the main effect plus +SqFtTotLiving:ZipGroup5+,
or $115 + $227 = $342 per square foot.
In other words, adding a square foot in the most expensive zip code group boosts the predicted sale price by a factor of almost three, compared to the average boost from adding a square foot."

I am thinking about two things:
1.) The coefficient for +SqFtTotLiving+ is 1.148e+02, but it is stated that "the main effect +SqFtTotLiving+ [...] is $118 per square foot". I think it should be adjusted to $115 as mentioned in the subsequent sentece.

2.) Since R uses reference coding (and not deviation coding), I wonder whether the last sentence is correct. Is it really the "average boost from adding a square foot" you compare to with the total effect of the most expensive zip code group? I mean, if you don't include any interaction effect the coefficient of +SqFtTotLiving+ would be the "average boost" (as far as I think about it). But in the setting with an interaction effect and reference coding, I would have interpreted it as "compared to the average boost for the lowest zip code group". Or am I wrong and the average boost is the same as the main affect, which in turn is equal for the first ZipGroup?

Best regards

Note from the Author or Editor:
Thank you for your feedback. This corresponds to page 175 second paragraph in the print edition.

1) $118 should be replaced with $115

2) We are going to change the end of the second paragraph for clarification to:
... to the average boost from adding a square foot in the lowest zip code group.

Gitlab is changed.

Marcus Fraaß  Nov 10, 2020 
PDF
Page 79
Second to last paragraph in key terms box

In the key term box, under "Binomial distribution" the sentence reads as follows: "Distribution of number of successes in x trials."
However, I think it should read "n trials" for the sake of consistency with the first sentence following the key terms box, where it reads: "The binomial distribution is the frequency distribution of the number of successes (x) in a given number of trials (n) with specified probability (p) of success in each trial."
I find it confusing that the sentence after the the number of trials is abbreviated with n, while in the box it is abbreviated as x".

Best regards,
Michael

Note from the Author or Editor:
Thank you for the feedback.

I checked other uses in the book and we consistently use _n_ trials. We will change this. (Done in Gitlab)

Michael Ustaszewski  Nov 05, 2020 
Printed
Page 170
Key terms box - first item

Change definition of `Correlated variables` to

Variables that tend to move in the same direction - when one goes up so does the other, and vice-versa (with negative correlation, when one goes up the other does down). When the predictor variables are highly correlated, it is difficult to interpret the
individual coefficients.

Peter Gedeck
 
Sep 16, 2020  Oct 02, 2020
Printed
Page 257
Section 'Controlling tree complexity in _Python_'

scikit-learn implements Tree-complexity pruning like in R

In version 0.22, scikit-learn implemented tree complexity pruning for decision trees.

https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py
https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning

Replace with:

===== Controlling tree complexity in _Python_
In the +scikit-learn+’s decision tree implementation, the complexity parameter is called +cpp_alpha+. The default value is 0, which means that the tree is not pruned; increasing the value leads to smaller trees. You can use GridSearchCV to find an optimal value.

There are a number of other model parameters, that allow controlling the tree size. For example, we can vary +max_depth+ in the range 5 to 30 and +min_samples_split+ between 20 and 100. The +GridSearchCV+ method in +scikit-learn+ is a convenient way to combine the exhaustive search through all combinations with cross-validation. An optimal parameter set is then selected using the cross-validated model performance.

Peter Gedeck
 
Sep 16, 2020  Oct 02, 2020
Printed
Page 66
End of last paragraph

This was already changed once to $55,836, but the actual value should be $55,734. I remember that I found this confusing too, so I suggest we add a clarification to this.

... for which the mean was $55,734. Note that this is the mean of the subset of 20 records and not the mean of the bootstrap analysis, $55,836.

Changed in repository

Peter Gedeck
 
Sep 16, 2020  Oct 02, 2020
Printed
Page 27
1st paragraph

Text currently states:
...flights by the cause of delay at Dallas/Fort Worth Airport since 2010.

should be:
...flights by the cause of delay at Dallas/Fort Worth Airport in 2010.

Peter Gedeck
 
Sep 16, 2020  Oct 02, 2020
PDF
Page 98
4th and 5th paragraphs

In Google Analytics the average session time does not measure the time spent on a given page (as stated in the book), the correct metric is average time on page. Furthermore, on the last paragraph we have "Also note that Google Analytics, which is how we measure session time, cannot measure session time for the last session a person visits." I think it would be more correct to say: Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent on the last page within a session. Finally, Google Analytics does indeed set the time spent on the last page in a session to zero, and a single-page session is also set to zero. Having said that, this is true only if there are no user interaction events triggered on that page, such as click events, scroll events, video events, etc.

Note from the Author or Editor:
Thank you for the feedback. We change the text int the book to:

Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent
on the last page within a session. ((("Google Analytics")))
Google Analytics will set the time spent on the last page in a session to zero, unless the user interacts with the page, e.g. clicks or scrolls. This is also the case for single-page sessions. The data requires additional processing to take this into account.

Joao Correia  Sep 06, 2020  Oct 02, 2020
PDF
Page 84
2nd line of code

The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.


Note from the Author or Editor:
The errata is correct and requires a change in the book.

Suggested change:

The +scipy+ implementation in +Python+ specifies the exponential distribution using +scale+ instead of rate. With scale being the inverse of rate, the corresponding command in Python is:

.Python
[source,python]
----
stats.expon.rvs(scale=1/0.2, size=100)
stats.expon.rvs(scale=5, size=100)
----

Joao Correia  Sep 05, 2020  Oct 02, 2020
Printed
Page 279
1st paragraph inside box

The sentence starting with "The xgboost parameters..." is duplicated in the second paragraph.

Delete first paragraph.

Peter Gedeck
 
Jun 06, 2020  Jun 19, 2020
Printed
Page 66
end of page

The mean of the sample of 20 datasets that was used to generate Figure 2-9 was $55,734.

Replace $62,231 with $55,734.

Peter Gedeck
 
Jun 06, 2020  Jun 19, 2020
Printed
Page 4
Further Reading

First bullet point in "Further Reading" repeated in second half of Second bullet point.

Delete first bullet point

Added as gitlab issue

Peter Gedeck
 
Jun 06, 2020  Jun 19, 2020
Other Digital Version
Example notebook (3)
F-Statistic section

There are two function that are use. As far as I understand, they should return the same result. This is not the case with the code as it is writte.


model = smf.ols('
model = smf.ols('Time ~ Page', data=four_sessions).fit()

aov_table = sm.stats.anova_lm(model)
print(aov_table)

df sum_sq mean_sq F PR(>F)
Page 3.0 831.4 277.133333 2.739825 0.077586
Residual 16.0 1618.4 101.150000 NaN NaN


res = stats.f_oneway(four_sessions[four_sessions.Page == 'Page 1'].Time,
four_sessions[four_sessions.Page == 'Page 2'].Time,
four_sessions[four_sessions.Page == 'Page 3'].Time,
four_sessions[four_sessions.Page == 'Page 4'].Time)
print(f'F-Statistic: {res.statistic / 2:.4f}')
print(f'p-value: {res.pvalue / 2:.4f}')

F-Statistic: 1.3699
p-value: 0.0388


As we can see, the first F-statistic and p-value are two times bigger than the second ones. But there are no explanation at all to tell the reader why...


To get the same result, I had to pivot the data frame before the call to f_oneway:

four_sessions['index'] = four_sessions.reset_index().index // 4
p_sessions = four_sessions.pivot(index='index', columns='Page', values='Time')
r = stats.f_oneway(p_sessions['Page 1'], p_sessions['Page 2'], p_sessions['Page 3'], p_sessions['Page 4'])
print(r)

F_onewayResult(statistic=2.739825341901467, pvalue=0.0775862152580146)

Note from the Author or Editor:
This only impacts the jupyter notebook.

The code with the error (division by two when printing the F-statistic and the p-value) is not included in the book. The mistake was due to copy/paste from the t_test example code.

The jupyter notebook contains the correct code now.

Fabrice Kinnar  May 06, 2020  Jun 19, 2020
Printed, PDF, ePub, Mobi, , Other Digital Version
Page 2735/9783
5th paragraph

In the F-Statistics section :

"For the residuals, degrees of freedom is 20 (all observations can vary), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom."

Bruce, Peter; Bruce, Andrew; Gedeck, Peter. Practical Statistics for Data Scientists (Emplacements du Kindle 2760-2762). O'Reilly Media. Édition du Kindle.

When you run the ANOVA on R or Python, you have 16 for df in Residuals, not 20 !!

Note from the Author or Editor:
The text should read:

For the residuals, degrees of freedom is 16 (20 observations, 16 of which can vary after the the grand mean and the treatment means are set), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom.

Fabrice Kinnar  May 06, 2020  Jun 19, 2020