Errata

Practical Statistics for Data Scientists

Errata for Practical Statistics for Data Scientists, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
PDF Page Statistical Significance and p-Values
Code example

In this section, we used the function perm_fun() which was defined in "Resampling" Section. The original function calculates the mean difference between samples, but in the example presented here, we need to calculate the difference between proportions. To solve this issue, a proposed solution is to create a new function that only changes this specific line of code:

return x.loc[list(idx_B)].mean() - x.loc[list(idx_A)].mean()

with this:

return x.loc[list(idx_B)].sum()/ nB - x.loc[list(idx_A)].sum()/ nA

(The code is in Python)

Anonymous  May 02, 2023 
Printed Page AB testing
N/a

I had a question about A/B testing which I hope you can help with.

My situation is that I am building a propensity model to identify which of our company's customers are most likely to sign up to a service after being sent an email. The model produces a likelihood score between 0 and 1 for each member and ranks them from 1 to X, where X is the size of our customer base. In practice we would then select the top N from this list to email. So, if our customer base is 1 million members, we would then select say the 100,000 ranked most highly (i.e. that have the highest likelihood score) by the model.

I want to compare how good the model is at identifying likely signups compared to the current business rules. However, I do not know how to conduct an A/B test in this scenario. Or indeed whether an A/B test is the most appropriate test here.

I understand the usual principle of randomly splitting the population into two groups and applying different treatments to each, such as a webpage layout or a drug treatment. However, in my case, the thing we are testing the efficacy of is the selection method itself (the email we send to each group would be the same). If we were to randomly split the population into two groups, then there will likely be customers who the model has ranked very highly, but which are in the business-rules group. Which seems to me like it wouldn't be fairly testing the model, because we are not giving it the chance to prove itself - we are not letting it have all of its 'top picks'.

Do you have any advice on this?

Anonymous  Mar 19, 2024 
Printed Page 53
Sample Mean Versus Population mean

The symbol used to represent the mean of population is missing.

Anonymous  Feb 07, 2022 
Printed Page 53
3rd paragraph

left out symbol for mean of population in …”whereas is used to represent the mean of a population.”

John Taylor  Jul 30, 2022 
Printed, PDF Page 99
4rd paragraph that start with "Page B has session times that are greater than those of page A by 35.67 seconds"

i think in "The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant", last statement should removed "i.e., is statistically significant"

Mohammed Kamal Alsyd  Jul 21, 2023 
Printed Page 137
3rd paragrapgh

Because the earlier draws presented in both the previous page and in the page in question (137) are suggesting that whatever value (ones) Box A takes, the Box B gets as many zeros as the remainder of the whole, which is 10,000.
The whole being 10,000 and the Box B gets the remainder of what Box A got might not have been a strict rule there, but if it was meant to be as such, then from a standpoint of consistency, I suppose in the 3rd paragraph of page 137, with the boosted up new value of 165 (1.65%) ones for Box A, the Box B should also be equal to the remainder of 10,000 which would be 9835 (not 9868).


Emir Bilim  Dec 24, 2021 
Printed Page 189
Figure 4-10

**Figure 4-10 — LOWESS target in partial-residual plot**

In the associated code in the GitHub repository for Figure 4-10, the LOWESS smoother is applied to the component (`results.ypartial`) rather than to the **partial residuals**. Replace:

```python
smoothed = sm.nonparametric.lowess(results.ypartial, results.feature, frac=1/3)
```

with:

```python
smoothed = sm.nonparametric.lowess(
results.ypartial + results.residual, # PR_i(x)
results.feature,
frac=1/3
)
```

**Rationale.** The gray dashed line is intended to be a LOWESS smooth of the **partial residuals** to show the empirical relationship without imposing a polynomial. Smoothing the component (the black line) instead does not match the textbook definition of a partial-residual plot.

Nahid Ahmadvand  Sep 20, 2025 
PDF Page 200
Numeric Predictor Variables

Note from the Author or Editor:
------------------
Hello,

I looked at this more. In contrast to the R implementation, GaussianNB treats categorical features as numerical. This is not correct. We can see this if we build a model with categorical features only. With the MultinomialNB, we get this prediction:
array([[0.65369619, 0.34630381]]))
while the GaussianNB results in this prediction:
(array([[9.99994372e-01, 5.62841090e-06]])
so essentially [1, 0].

It will be necessary to build two separate models and then combine the predictions.
------------------
My Response:

Thanks for taking a closer look.
You’re absolutely right that GaussianNB treats all inputs as continuous. If we feed it one-hot encoded categoricals, those 0/1 dummies are modeled as Gaussians with very small within-class variances, which can drive extreme posteriors (the near [1, 0] you observed when using only categoricals). That behavior is consistent with a mismatch between the model family (Gaussian) and the data type (categorical).

That said, if most of the signal lies in the continuous features and the goal is ranking by P(Y=1), a GaussianNB-centric approach can work quite well even if the categorical piece is handled imperfectly. As the book notes, Naive Bayes often provides decent ranking but biased probability estimates; calibration (e.g., CalibratedClassifierCV with isotonic or sigmoid) helps if calibrated probabilities matter.

Best,

Nahid Ahmadvand  Sep 23, 2025 
Printed Page 278
Inset on Ridge regression and the Lasso

The indices of X should be X_{p,i} (cf. p. 151), but as it currently stands we have X_i and X_p. Shouldn't `i` refer to the example index and `p` refer to the dimension index?

Amine Laghaout  Feb 15, 2023