Version 
Location 
Description 
Submitted By 
Date submitted 
Date corrected 
Other Digital Version 
Example notebook (3)
FStatistic section 
There are two function that are use. As far as I understand, they should return the same result. This is not the case with the code as it is writte.
model = smf.ols('
model = smf.ols('Time ~ Page', data=four_sessions).fit()
aov_table = sm.stats.anova_lm(model)
print(aov_table)
df sum_sq mean_sq F PR(>F)
Page 3.0 831.4 277.133333 2.739825 0.077586
Residual 16.0 1618.4 101.150000 NaN NaN
res = stats.f_oneway(four_sessions[four_sessions.Page == 'Page 1'].Time,
four_sessions[four_sessions.Page == 'Page 2'].Time,
four_sessions[four_sessions.Page == 'Page 3'].Time,
four_sessions[four_sessions.Page == 'Page 4'].Time)
print(f'FStatistic: {res.statistic / 2:.4f}')
print(f'pvalue: {res.pvalue / 2:.4f}')
FStatistic: 1.3699
pvalue: 0.0388
As we can see, the first Fstatistic and pvalue are two times bigger than the second ones. But there are no explanation at all to tell the reader why...
To get the same result, I had to pivot the data frame before the call to f_oneway:
four_sessions['index'] = four_sessions.reset_index().index // 4
p_sessions = four_sessions.pivot(index='index', columns='Page', values='Time')
r = stats.f_oneway(p_sessions['Page 1'], p_sessions['Page 2'], p_sessions['Page 3'], p_sessions['Page 4'])
print(r)
F_onewayResult(statistic=2.739825341901467, pvalue=0.0775862152580146)
Note from the Author or Editor: This only impacts the jupyter notebook.
The code with the error (division by two when printing the Fstatistic and the pvalue) is not included in the book. The mistake was due to copy/paste from the t_test example code.
The jupyter notebook contains the correct code now.

Fabrice Kinnar 
May 06, 2020 
Jun 19, 2020 

Page ch 1
text 
In Chapter 1, there is an external link that does not work: "stepbystep guide to creating a boxplot." at location 684. Please update with a valid external URL.
Note from the Author or Editor: The link:
https://oreil.ly/wTpnE
should be replace with:
https://web.archive.org/web/20190415201405/https://www.oswego.edu/~srp/stats/bp_con.htm

Anonymous 
Mar 17, 2022 


Page Ch 2
text 
In Chapter 2 there is an external link that does not work "Fooled by Randomness Through Selection Bias" at location 1347. Please update a valid external URL.
Note from the Author or Editor: This is referring to link https://oreil.ly/v_Q0u
The correct link is now:
https://www.priceactionlab.com/Blog/2012/06/fooledbyrandomnessthroughselectionbias/

Anonymous 
Mar 17, 2022 


Page Page 37
The second last code snippet 
The R code snippet will not generate a figure similar to Figure 18. But the Python code snippet at the bottom of the same page will.
Note from the Author or Editor: This is a temporary issue that was introduced in version 3.4.0 of ggplot. The ggplot developers are aware of the problem and fixed it. An updated version has not been release yet.
https://github.com/tidyverse/ggplot2/pull/5045
https://github.com/tidyverse/ggplot2/issues/5037

Jiamin Wang 
Jan 05, 2023 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page Python code using the get_dummies function
NA 
The behavior of the get_dummies method has changed recently. Instead of creating integer columns containing 0 and 1, the function now creates boolean columns with True and False values. This causes statsmodels model building to fail with an exception.
To revert to the original behavior, add the keyword argument `dtype=int` to the get_dummies method calls.

Peter Gedeck 
May 24, 2023 


Page Example: Web Stickiness
https://learning.oreilly.com/library/view/practicalstatisticsfor/9781492072935/ch03.html#::text=The%20question%20is%20whether,e.%2C%20is%20statistically%20significant. 
In the subsection Example: Web Stickiness of Permutation Test of Resampling of Chapter 3, there is a conflict between two statements as below:
S1: Page B has times that are greater than those of page A by 35.67 seconds, on average. The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant.
> The conclusion "i.e., is statistically significant" seems to be misleading when comparing to the following statement:
S2: This suggests that the observed difference in time between page A and page B is well within the range of chance variation and thus is not statistically significant.
In brief, I think it should be "i.e., is not statistically significant." in S1.
Note from the Author or Editor: p. 99, center paragraph, second sentence should read: "The question is whether this difference is within the range of what random chance might produce, i.e. is not statistically significant." [the "not" had been left out]

Anonymous 
May 27, 2023 

Printed 
Page 4
Further Reading 
First bullet point in "Further Reading" repeated in second half of Second bullet point.
Delete first bullet point
Added as gitlab issue

Peter Gedeck 
Jun 06, 2020 
Jun 19, 2020 
PDF, ePub 
Page 4
First and second bullets in the "Further Reading" section. 
The link to the pandas documentation ( https://oreil.ly/UGX4 ) results in a 404 error. The O'Reilly redirect appear to attempt to access https://pandas.pydata.org/pandasdocs/stable/getting_started/basics.html#basicsdtypes .
Note from the Author or Editor: We need to change to redirect
https://oreil.ly/UGX4
to redirect to:
https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
Ideally, this can be done without changing the short URL
redirect all set (O'Reilly errata team)

Matt Slaven 
Mar 29, 2021 
Mar 30, 2021 

Page 19
3d bullet point of Key Ideas 
Bullet point suggests that mean absolute deviation is robust which contradicts 2nd paragraph of page 16
Note from the Author or Editor: We change the 2nd and 3rd paragraph on page 16 to:
Neither the variance, the standard deviation, nor the mean absolute deviation is fully robust to outliers and extreme values
(see <<Median>> for a discussion of robust estimates for location).
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations;
more robust is the _median absolute deviation from the median_ or MAD:
Gitlab code is updated 20210104.

Anonymous 
Dec 29, 2021 

Printed 
Page 27
1st paragraph 
Text currently states:
...flights by the cause of delay at Dallas/Fort Worth Airport since 2010.
should be:
...flights by the cause of delay at Dallas/Fort Worth Airport in 2010.

Peter Gedeck 
Sep 16, 2020 
Oct 02, 2020 

Page 44
Ordered Item 1 
The writer's text statement ggplot has functions facet_wrap and facet_grid is unclear. It is unclear because the writer instructs the reader to use the function facet_grid in R but does not provide the R syntax. The Python facet_grid syntax is provided on page 45.
Note from the Author or Editor: The example uses facet_wrap as there is only one conditioning variable. The R function facet_wrap will, by default, set the number of rows and columns in such a way that the resulting grid is close to square. In the example, this leads to a 2x2 grid. If there are two conditioning variables, you would need to use facet_grid.
In general, we recommend to consult the package documentation. The package ggplot comes with comprehensive documentation at https://ggplot2.tidyverse.org/index.html.
I'm going to add a sentence to the manuscript to highlight the fact that facet_grid would be used for two conditioning variables.

Stephen Dawson 
Mar 10, 2022 

Printed 
Page 66
end of page 
The mean of the sample of 20 datasets that was used to generate Figure 29 was $55,734.
Replace $62,231 with $55,734.

Peter Gedeck 
Jun 06, 2020 
Jun 19, 2020 
Printed 
Page 66
End of last paragraph 
This was already changed once to $55,836, but the actual value should be $55,734. I remember that I found this confusing too, so I suggest we add a clarification to this.
... for which the mean was $55,734. Note that this is the mean of the subset of 20 records and not the mean of the bootstrap analysis, $55,836.
Changed in repository

Peter Gedeck 
Sep 16, 2020 
Oct 02, 2020 
PDF 
Page 79
Second to last paragraph in key terms box 
In the key term box, under "Binomial distribution" the sentence reads as follows: "Distribution of number of successes in x trials."
However, I think it should read "n trials" for the sake of consistency with the first sentence following the key terms box, where it reads: "The binomial distribution is the frequency distribution of the number of successes (x) in a given number of trials (n) with specified probability (p) of success in each trial."
I find it confusing that the sentence after the the number of trials is abbreviated with n, while in the box it is abbreviated as x".
Best regards,
Michael
Note from the Author or Editor: Thank you for the feedback.
I checked other uses in the book and we consistently use _n_ trials. We will change this. (Done in Gitlab)

Michael Ustaszewski 
Nov 05, 2020 

PDF 
Page 84
2nd line of code 
The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.
Note from the Author or Editor: The errata is correct and requires a change in the book.
Suggested change:
The +scipy+ implementation in +Python+ specifies the exponential distribution using +scale+ instead of rate. With scale being the inverse of rate, the corresponding command in Python is:
.Python
[source,python]

stats.expon.rvs(scale=1/0.2, size=100)
stats.expon.rvs(scale=5, size=100)


Joao Correia 
Sep 05, 2020 
Oct 02, 2020 
PDF 
Page 98
4th and 5th paragraphs 
In Google Analytics the average session time does not measure the time spent on a given page (as stated in the book), the correct metric is average time on page. Furthermore, on the last paragraph we have "Also note that Google Analytics, which is how we measure session time, cannot measure session time for the last session a person visits." I think it would be more correct to say: Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent on the last page within a session. Finally, Google Analytics does indeed set the time spent on the last page in a session to zero, and a singlepage session is also set to zero. Having said that, this is true only if there are no user interaction events triggered on that page, such as click events, scroll events, video events, etc.
Note from the Author or Editor: Thank you for the feedback. We change the text int the book to:
Also note that Google Analytics, which is how we measure average time on page, cannot measure the time spent
on the last page within a session. ((("Google Analytics")))
Google Analytics will set the time spent on the last page in a session to zero, unless the user interacts with the page, e.g. clicks or scrolls. This is also the case for singlepage sessions. The data requires additional processing to take this into account.

Joao Correia 
Sep 06, 2020 
Oct 02, 2020 

Page 122
first paragraph 
For the grand average, sum of squares is the departure of the grand average from 0, squared, times 20 (the number of observations). The degrees of freedom for the grand average is 1, by definition.
degree of freedom for grand average is 19 not 1. also i think the whole page need review since the code result don't match with the written text as example For the residuals, degrees of freedom is 20 (all observations can vary) while it is actually 16 not 20
Note from the Author or Editor: The last sentence in this paragraph, "The degrees of freedom for the grand average is 1, by definition." should be eliminated, without a replacement.

Mohammed Kamal Alsyd 
May 05, 2023 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 127
Python code end of page 
Issue reported on github repository:
The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'pvalue: {pvalue:.4f}')
I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.
This is correct. Code and book text corrected.

Peter Gedeck 
Apr 09, 2021 

Printed 
Page 170
Key terms box  first item 
Change definition of `Correlated variables` to
Variables that tend to move in the same direction  when one goes up so does the other, and viceversa (with negative correlation, when one goes up the other does down). When the predictor variables are highly correlated, it is difficult to interpret the
individual coefficients.

Peter Gedeck 
Sep 16, 2020 
Oct 02, 2020 
Printed, PDF, ePub 
Page 175
2nd paragraph 
Regarding the paragraph
"Location and house size appear to have a strong interaction.
For a home in the lowest +ZipGroup+,
the slope is the same as the slope for the main effect +SqFtTotLiving+,
which is $118 per square foot (this is because _R_ uses _reference_ coding for factor variables; see 'Factor Variables in Regression').
For a home in the highest +ZipGroup+,
the slope is the sum of the main effect plus +SqFtTotLiving:ZipGroup5+,
or $115 + $227 = $342 per square foot.
In other words, adding a square foot in the most expensive zip code group boosts the predicted sale price by a factor of almost three, compared to the average boost from adding a square foot."
I am thinking about two things:
1.) The coefficient for +SqFtTotLiving+ is 1.148e+02, but it is stated that "the main effect +SqFtTotLiving+ [...] is $118 per square foot". I think it should be adjusted to $115 as mentioned in the subsequent sentece.
2.) Since R uses reference coding (and not deviation coding), I wonder whether the last sentence is correct. Is it really the "average boost from adding a square foot" you compare to with the total effect of the most expensive zip code group? I mean, if you don't include any interaction effect the coefficient of +SqFtTotLiving+ would be the "average boost" (as far as I think about it). But in the setting with an interaction effect and reference coding, I would have interpreted it as "compared to the average boost for the lowest zip code group". Or am I wrong and the average boost is the same as the main affect, which in turn is equal for the first ZipGroup?
Best regards
Note from the Author or Editor: Thank you for your feedback. This corresponds to page 175 second paragraph in the print edition.
1) $118 should be replaced with $115
2) We are going to change the end of the second paragraph for clarification to:
... to the average boost from adding a square foot in the lowest zip code group.
Gitlab is changed.

Marcus Fraaß 
Nov 10, 2020 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 191
Figure 412 
I believe that Figure 412 on page 191 is in error because the code used to generate it (Chapter 4  Regression and Prediction.R from the practicalstatisticsfordatascientistsmaster.zip file) appears to be in error.
The code states:
terms1 < predict(lm_spline, type='terms')
partial_resid1 < resid(lm_spline) + terms
but surely partial_resid1 should be:
partial_resid1 < resid(lm_spline) + terms1
which would give rise to a slightly different plot?
Note from the Author or Editor: I can confirm the error in the R code. The R code is not printed in the book, but the image created is. As mentioned in the errata, the difference in the plot is only small.
I changed the code to create the correct plot.
New figure file images/psds_0412.png added to book repository. This file will need to be processed (cropping the whitespace) to replace the file psds2_0412.png.

Gabriel Simmonds 
Apr 25, 2021 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 200
predicted probabilities 
A reader reported different results for the predictions from the Naive Bayes model. The change was caused by the following. In version 4 of R, read.csv no longer converts string columns automatically into factors. The old behavior can be restored by setting stringsAsFactors=TRUE .
There is no change required in the book. The GitHub repository will be updated with the change.

Peter Gedeck 
Feb 27, 2021 

Printed 
Page 213
7th (4th of paragraph "Interpreting the Coefficients and Odds Ratios") 
Regarding the paragraph
"An example will make this more explicit.
For the model fit in "Logistic Regression and the GLM" on page 210,
the regression coefficient for +purpose_small_business+ is 1.21526.
This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.
Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans."
Suggested change:
This means that a loan to a small business compared to a loan to pay off credit card debt *increases* the odds of defaulting versus being paid off by exp(1.21526) ≈ 3.4.
Best regards
Note from the Author or Editor: The errata is correct. Gitlab document changed accordingly  PG

Marcus Fraaß 
Nov 17, 2020 

Printed 
Page 217
R code block at end of page 
On page 217 of the printed book (2nd edition), the R code at the end of the page reads:
terms < predict(logistic_gam, type='terms')
partial_resid < resid(logistic_model) + terms
df < data.frame(payment_inc_ratio = loan_data[, 'payment_inc_ratio'],
terms = terms[, 's(payment_inc_ratio)'],
partial_resid = partial_resid[, 's(payment_inc_ratio)'])
I believe that partial_resid here should be:
partial_resid < resid(logistic_gam) + terms
I'm not sure if the graph produced on page 218 (Figure 54) using this data needs correction or not, as the difference using logistic_model and logistic_gam is quite minor, and it is hard to tell comparing a screenshot and the printed page.
Note from the Author or Editor: The line needs to be changed in the asciidoc code. It is already corrected in the book's Github repository, however I overlooked changing the book text. That is now corrected too.

Gabriel Simmonds 
May 11, 2021 

Printed 
Page 240
2nd paragraph 
Since the R code yields TRUE for the prediction knn_pred == 'paid off', the sentence
"The KNN prediction is for the loan to default."
seems to be wrong and "default" should be replaced with "be paid off".
Note from the Author or Editor: This is correct. The sentence should read:
The KNN prediction is for the loan to be paid off.

Marcus Fraaß 
Dec 06, 2020 

Printed 
Page 257
Section 'Controlling tree complexity in _Python_' 
scikitlearn implements Treecomplexity pruning like in R
In version 0.22, scikitlearn implemented tree complexity pruning for decision trees.
https://scikitlearn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphxglrautoexamplestreeplotcostcomplexitypruningpy
https://scikitlearn.org/stable/modules/tree.html#minimalcostcomplexitypruning
Replace with:
===== Controlling tree complexity in _Python_
In the +scikitlearn+’s decision tree implementation, the complexity parameter is called +cpp_alpha+. The default value is 0, which means that the tree is not pruned; increasing the value leads to smaller trees. You can use GridSearchCV to find an optimal value.
There are a number of other model parameters, that allow controlling the tree size. For example, we can vary +max_depth+ in the range 5 to 30 and +min_samples_split+ between 20 and 100. The +GridSearchCV+ method in +scikitlearn+ is a convenient way to combine the exhaustive search through all combinations with crossvalidation. An optimal parameter set is then selected using the crossvalidated model performance.

Peter Gedeck 
Sep 16, 2020 
Oct 02, 2020 
Printed 
Page 279
1st paragraph inside box 
The sentence starting with "The xgboost parameters..." is duplicated in the second paragraph.
Delete first paragraph.

Peter Gedeck 
Jun 06, 2020 
Jun 19, 2020 
Printed 
Page 302
last paragraph 
"Figure 77 shows the cumulative percent of variance explained for the default data for the number of clusters ranging from 2 to 15."
Just a few minor things here:
 "2 to 15" should be replaced by "2 to 14"
 "default data" should be replaced by "stock data"
 Due to harmonization, the python code on the following page might be adjusted, so that range(2, 15) is used instead of range(2, 14).
Note from the Author or Editor: All suggestions confirmed.
Book text changed.

Marcus Fraaß 
Dec 06, 2020 

Printed 
Page 306
Python code middle 
Due to a change in one of the Python packages, the code causes an error. The following code is working:
fig, ax = plt.subplots(figsize=(5, 5))
dendrogram(Z, labels=list(df.index), color_threshold=0)
plt.xticks(rotation=90)
ax.set_ylabel('distance')
Book text changed

Peter Gedeck 
Dec 07, 2020 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 323
First code snippet 
The output of the first line of x is incorrect and should be major_purchase instead of car.
> x
dti payment_inc_ratio home_ purpose_
1 1.00 2.39320 RENT major_purchase
2 5.55 4.57170 OWN small_business
3 18.08 9.71600 RENT other
4 10.08 12.21520 RENT debt_consolidation
5 7.06 3.90888 RENT other
gitlab code corrected

Peter Gedeck 
Feb 22, 2021 


Page 441
The Boosting Algorithm section, step 3 
The equation for alpha_m is surely wrong as in my kindle app it is shown as
alpha_m = (log 1  e_m)/e_m
This can't be right as it would simplify to 1
According to wikipedia section on adaboost example, I suppose the formula should be alpha_m = 1/2 * ln ((1  e_m)/e_m)
Which would make more sense
Note from the Author or Editor: I can confirm the issue and needs to be corrected as suggested.
 Gitlab updated to latexmath:[$\alpha_m = \frac12 \log\frac{1  e_m}{e_m}$]

Tapani Raunio 
Dec 06, 2021 

Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 2735/9783
5th paragraph 
In the FStatistics section :
"For the residuals, degrees of freedom is 20 (all observations can vary), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom."
Bruce, Peter; Bruce, Andrew; Gedeck, Peter. Practical Statistics for Data Scientists (Emplacements du Kindle 27602762). O'Reilly Media. Édition du Kindle.
When you run the ANOVA on R or Python, you have 16 for df in Residuals, not 20 !!
Note from the Author or Editor: The text should read:
For the residuals, degrees of freedom is 16 (20 observations, 16 of which can vary after the the grand mean and the treatment means are set), and SS is the sum of squared difference between the individual observations and the treatment means. Mean squares (MS) is the sum of squares divided by the degrees of freedom.

Fabrice Kinnar 
May 06, 2020 
Jun 19, 2020 