Errata

Python Data Science Handbook

Errata for Python Data Science Handbook

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted by Date submitted
ePub Page pg 110
NaN: Missing numerical data

pg 110 NaN: Missing numerical data

In[8]: vals2.sum(), vals2.min(), vals2.max()
Out[8]: (nan, nan, nan)

--------------------------------------------------------

before output, pandas 0.23.4 displays:

RuntimeWarning: invalid value encountered in reduce

Gregory Sherman  Dec 13, 2018 
Chap 3
Example: Recipe Database

The section mentions downloading recipeitems-latest.json.gz, unfortunately this file no longer contains data when downloaded from the S3 bucket, so the example code cannot be followed along.

Anonymous  Dec 20, 2018 
Printed Page Page 84
The code example in "Binning Data"

This is just an addition to my previous errata report for the same page number. In fact the bins are 19. Because 20 points in the bins array define 19 bins. The idea behind this code example is nice really... but the example itself is messed up, some corner cases are not well thought over.

Peter Petrov  Mar 30, 2022 
Printed Page Stephen Joseph
Combining Datasets: Merge and Join

also found in https://jakevdp.github.io/PythonDataScienceHandbook/
This is actually more of a conceptual error.

Combining Datasets: Merge and Join:
In actuality, it is the Dataset on the Left Side in the pd.merge() function that generally drives the order of the key column in the Resultant Dataset. So it's more often the index on the right that gets discarded, not both.

i.e. df1 in df3 = pd.merge(df1, df2)
(here the index in df3 will be driven by df1).

The confusion arises due to the key column (in this case 'employee') is
already sorted in alpha order in df1.

Try reversing the position of the datasets
i.e. set df3 = pd.merge(df2, df1)
(...and you will see the index of df3 driven by df2, not df1!)

This same issue is often arises when coding in SQL.


Stephen Joseph  Jul 24, 2022 
PDF Page HOG in Action: A Simple Face Detector, page 508
In[4]

Statement for
images = [color.rgb2gray(getattr(data, name)())
for name in imgs_to_use]

is not working in
3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
Windows-10-10.0.22621-SP0
scikit-image version: 0.19.2
numpy version: 1.21.5

Error messages are:
ValueError Traceback (most recent call last)
Input In [5], in <cell line: 6>()
1 from skimage import data, transform
3 imgs_to_use = ['camera', 'text', 'coins', 'moon',
4 'page', 'clock', 'immunohistochemistry',
5 'chelsea', 'coffee', 'hubble_deep_field']
----> 6 images = [color.rgb2gray(getattr(data, name)())
7 for name in imgs_to_use]

Input In [5], in <listcomp>(.0)
1 from skimage import data, transform
3 imgs_to_use = ['camera', 'text', 'coins', 'moon',
4 'page', 'clock', 'immunohistochemistry',
5 'chelsea', 'coffee', 'hubble_deep_field']
----> 6 images = [color.rgb2gray(getattr(data, name)())
7 for name in imgs_to_use]

File ~\anaconda3\lib\site-packages\skimage\_shared\utils.py:394, in channel_as_last_axis.__call__.<locals>.fixed_func(*args, **kwargs)
391 channel_axis = kwargs.get('channel_axis', None)
393 if channel_axis is None:
--> 394 return func(*args, **kwargs)
396 # TODO: convert scalars to a tuple in anticipation of eventually
397 # supporting a tuple of channel axes. Right now, only an
398 # integer or a single-element tuple is supported, though.
399 if np.isscalar(channel_axis):

File ~\anaconda3\lib\site-packages\skimage\color\colorconv.py:875, in rgb2gray(rgb, channel_axis)
834 @channel_as_last_axis(multichannel_output=False)
835 def rgb2gray(rgb, *, channel_axis=-1):
836 """Compute luminance of an RGB image.
837
838 Parameters
(...)
873 >>> img_gray = rgb2gray(img)
874 """
--> 875 rgb = _prepare_colorarray(rgb)
876 coeffs = np.array([0.2125, 0.7154, 0.0721], dtype=rgb.dtype)
877 return rgb @ coeffs

File ~\anaconda3\lib\site-packages\skimage\color\colorconv.py:140, in _prepare_colorarray(arr, force_copy, channel_axis)
137 if arr.shape[channel_axis] != 3:
138 msg = (f'the input array must have size 3 along `channel_axis`, '
139 f'got {arr.shape}')
--> 140 raise ValueError(msg)
142 float_dtype = _supported_float_type(arr.dtype)
143 if float_dtype == np.float32:

ValueError: the input array must have size 3 along `channel_axis`, got (512, 512)

Anonymous  Nov 08, 2022 
Other Digital Version xxii
4th paragraph starting with Miniconda

I have no question with any text errors in the Python Data Science Handbook.
I can not install Miniconda although I followed the procedure outlined in the book on page xxii.
The procedure proposes:
mkdir -p ~/miniconda3
curl ttps://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

The screenshot after typing these three lines in my terminal looks as follows:
Last login: Sun Dec 10 00:12:10 on console (base) petermatthiessen@Peters-MacBook-Pro ~ % ipython zsh: command not found: ipython
(base) petermatthiessen@Peters-MacBook-Pro ~ % shasum -a 256 0c9d8ae96c110230a41c0441d5d486d47b627f59409]
Ode52989d01d04d18d8eee
shasum: 0c9d8ae96c110230a41c0441d5d486d47b627f594090de52989d01d04d18d8eee: No such file or directory [(base) petermatthiessen@Peters-MacBook-Pro ~ % 0c9d8ae96c110230a41c0441d5d486d47b627f594090de52989d01d04)
d18d8ee
zsh:
command not found: 0c9d8ae96c110230a41c0441d5d486d47b627f594090de52989d01d04d18d8ee
(base) petermatthiessen@Peters-MacBook-Pro ~ % shasum -a 256 0c9d8ae96c110230a41c0441d5d486d47b627f59409)
0de52989d01d04d18d8ee
shasum: 0c9d8a96c110230a41c0441d5d486d47b627f594090de52989d01d04d18d8ee: No such file or directory [(base) petermatthiessen@Peters-MacBook-Pro ~ % mkdir -p ~/miniconda3
(base) petermatthiessen@Peters-MacBook-Pro ~ % curl ttps://repo.anaconda.com/miniconda/Miniconda3-lates]
t-MacOSX-x8664.sh -o ~/miniconda3/miniconda.sh
% Total
% Received % ferd
Average Speed
Time
Time
Time Current
Dload
Upload
100
378
100
378
Total Spent Left Speed
1382
0--:--:--
--: --:--
1421
[(base) petermatthiessen@Peters-MacBook-Pro ~ % bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 /Users/petermatthiessen/miniconda/miniconda.Sh: Line 1: Syntax error near unexpected token "newline'
/Users/petermatthiessen/miniconda3/miniconda.sh: line 1:
(base) petermatthiessen@Peters-MacBook-Pro ~ % bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3|]

Any thoughts how could I successfully install Miniconda?

Best regards,

Peter M.

Peter Matthiessen  Dec 13, 2023 
Printed Page 11
2nd paragraph

Python interpreter not iPython interpreter

Anonymous  Oct 26, 2019 
ePub Page 28
middle on the page

The command
"In[ 9]: %load_ext line_profiler"
seems not to work anymore. I installed line_profiler with the pip as indicated on the page. However since I installed anaconda as indicated in the book, I tried conda to install the line_profiler.
That worked!
Best regards
Tony

Tony Hürliamnn  Aug 04, 2019 
PDF Page 42
1st paragraph at the end

df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std])
should be
df['NO. OBESE'].groupby(df['GRADE LEVEL']).aggregate([np.sum, np.mean, np.std])
and any reference to d should be replaced with df in this chapter

vOOda  Aug 06, 2021 
Printed Page 43
Last paragraph

"Comma separated tuplesof indices" should be "comma sepatated list of indices"

Massimiliano volpi  Apr 03, 2023 
Printed Page 46
Section Subarrays as no-copy views

n Python, slices are also no-copies views. Therefore, the sentence «This is one area in which Numpy array slicing differs from Python list slicing.» is wrong.

The only difference is when we are using advanced/fancy indexing. In this case, NumPy creates copies.



Ivo Tavares  Nov 18, 2020 
ePub Page 64
Table 2-3. Aggregation functions available in NumPy

np.mean np.nanmean Compute median of elements"

---
should be "Compute mean ..."

Gregory Sherman  Dec 12, 2018 
PDF Page 65
Figure 2-4

The equation of the third example shown in the figure should be "np.arange(3)[:, np.newaxis]+np.arange(3)", not "np.ones((3,1))+np.arange(3)".

Anonymous  Jan 26, 2017 
Printed Page 65
Figure 2-4

wrong: np.ones((3, 1) + np.arange(3)

correct: np.arange(3).reshape((3, 1)) + np.arange(3)

correct: np.arange(3)[:, np.newaxis] + np.arange(3)

Anonymous  Nov 02, 2019 
Printed Page 65
Figure 2.4

import numpy as np

np.ones((3, 1)) + np.arange(3)

Outcome should be:
array([[1., 2., 3.],
[1., 2., 3.],
[1., 2., 3.]])

André Roukema  Aug 14, 2022 
PDF Page 65
Figure 2-4

in figure 2-4 of page 65 third figure fist box must contain
1. 1. 1.
1. 1. 1.
1. 1. 1.
and last box must contain
1. 2. 3.
1. 2. 3.
1. 2. 3.

Anonymous  Jan 06, 2023 
PDF Page 75
In code snippet

print("Rainy days with < 0.1 inches :", np.sum((inches > 0) &
(inches < 0.2)))

The screen prints "Rainy days with < 0.1 inches" while the program calculating rainy days with < 0.2 inches.

Anonymous  Feb 01, 2017 
Printed Page 75
5th paragraph

wrong: np.sum((inches > 0) & (inches < 0.2))

correct: np.sum((inches > 0) & (inches < 0.1))

Anonymous  Nov 02, 2019 
Printed Page 75
2nd paragraph

where reads:
"...the equivalence of A AND B and NOT (A OR B)..."

should be read:
"...the equivalence of A AND B and NOT ((NOT A) OR (NOT B))..."

Pedro Sousa  Feb 04, 2020 
ePub Page 80
Example: Binning Data

For example, imagine we have 1,000 values
.
.
.
x = np.random.randn(100)
---
Both should be 100 or 1000

Gregory Sherman  Dec 12, 2018 
PDF Page 82
In[17]

In[17]: plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1], facecolor='none', s=200);

The code above won't show the large circles on the plot, as it is missing "edgecolor" . It is corrected by the following code:

In[17]: plt.scatter(selection[:, 0], selection[:, 1], facecolor='none', edgecolor='b', s=200);

utjo3105  Apr 08, 2017 
Printed Page 84
The code example in "Binning Data"

Actually
np.searchsorted(bins, x)

may return an array which contains an index equal to 20 (the number of bins). Then np.add.at(counts, i, 1) will raise an error.

This problem doesn't happen only because [-5,5] is a large interval and we're lucky that np.random.randn(100) didn't return any number bigger than 5. Of course the probability that np.random.randn(100) would return a number larger than 5 is small, but it's not zero.

How to prove there's a problem?

Say we try the same example using

bins = np.linspace(-1.5, 1.5, 20)

instead of

bins = np.linspace(-5, 5, 20)

Then the problem does manifest itself.

Peter Petrov  Mar 30, 2022 
PDF Page 89
First line of code


In[14]: X = rand.rand(10, 2)

Should be

In[14]: X = np.random.random((10, 2))

Anonymous  Mar 10, 2019 
PDF Page 93
3rd paragraph

>Using the equivalence of A AND B and NOT (A OR B)

This is not equivalent.

A = true
B = true

A and B = true
NOT (A OR B) = NOT (true) = false

The example that follows is also incorrect since it is based off that.

The wanted equivalence is, I suppose,

A and B = NOT (NOT (A) OR NOT (B)) by the Morgan's law

Anonymous  May 27, 2019 
ePub Page 94
Series as specialized dictionary


In[11]: population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

Out[11]: California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
----
Using version 0.23.4, the keys are output in the same order as in the dictionary
(not alphabetized), so "In[13]: population['California':'Illinois']" retrieves the entire series rather than the first three elements (ordered alphabetically by keys : 'California', 'Florida', 'Illinois')

The same issue is seen in later examples

Gregory Sherman  Dec 12, 2018 
Printed Page 94
top of page

Wouldn't it be better to say

# Get first element of data
# Get first tuple of data

Because data is a 1D-array and has no rows like a 2D-array.

The term "row" is misleading here because it implies that we have to do with a 2D data structure, which is not the case in my opinion.

Andrea P. Mathis  Nov 05, 2019 
PDF Page 95
Second sentence, first paragraph

I think it is more clear to state that characters "<" and ">" are used to specify the ordering convention for significant "bytes" instead of "bits".

Anonymous  Feb 22, 2017 
ePub Page 103
DataFrame as two-dimensional array

The ix indexer allows a hybrid of these two approaches:
In[30]: data.ix[:3, :'pop']
--------------------------------------------------------
In pandas 0.23.4, this results in:

"... DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing"

plus the Californis, Texas, and New York rows

Gregory Sherman  Dec 13, 2018 
PDF Page 104
1st paragraph

The related subsection title at the bottom page 103 is "DataFrame as specialized dictionay". However, explanation in the first paragraph of page 104 contains as followings:
"Because of this, it is probably better to think about DataFrames as generalized
dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful."

The subsection title and its corresponding explanation is somewhat conflicting: specialized vs. generalized.

One of them should be corrected for consistency.

Hongsoog Kim  Aug 21, 2017 
PDF, ePub Page 111
2nd Code Block

Use of 'axis' instead of 'axes'.

David  May 22, 2017 
ePub Page 117
Explicit MultiIndex constructors

Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):
-----------------------------------------------------------------------------------------------------------------
It seems that another word should be in place of the final one ("labels")

Gregory Sherman  Dec 19, 2018 
ePub Page 126
pg 126

n[8]: df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3); print(df4); print(pd.concat([df3, df4], axis='col'))

---------------------------------------------------------------------------------

pd.concat() fails ; needs to be "axis='columns'"

Gregory Sherman  Dec 19, 2018 
ePub Page 128
Concatenation with joins

In[13]: df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6])
---------------------------------------------------------------------
missing closing parenthesis at end of last print() call

Before third print() output:
FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.

Gregory Sherman  Dec 19, 2018 
130
Final paragraph on the page

The sentence reads "Seeing this, you might wonder why would we would bother withhierarchical indexing at all."

I believe that it should be "you might wonder why we would bother" rather than "you might wonder why would we *would* bother."

sterlinm  Jun 01, 2017 
Printed Page 151
Code below 2nd paragraph

code in the book is different than that on the website
on page 151, the 6th line of code including the commented line from the book (which will not work):

<p style= 'font-family:"Courier New", Courier, monospace'>{0}{1}
"""
code from the website (which will work):

<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""

David Walden  Jan 18, 2024 
ePub Page 159
In[12]

monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
--------------------------------------------------------
The case of the vowels in the regular expression makes no difference.
If the first and second character classes are switched, the results are identical.
So, it appears that matching is case-insensitive by default (unlike Python's re) - can it be made case-sensitive?

Gregory Sherman  Dec 25, 2018 
Printed Page 164
2nd paragraph

planets.groupby('method')['year'].describe().unstack()

I think the unstack method can be ommited to get the DataFrame.

Applying the unstack method yields to a (multiindexed) Series in my opinion.

Andrea P. Mathis  Nov 17, 2019 
Printed Page 164-165
last paragraph of 164

In the Dispatch methods, in the code "planets.groupby('method')['year'].describe().unstack()", calling the unstack method (using parenthesis) returns a 'pandas.core.series.Series' whereas it should return "<bound method DataFrame.unstack of method". Therefore, the parenthesis should be omitted in unstack to get desired result.

Correct code : planets.groupby('method')['year'].describe().unstack

Minhaz Uddin  Jul 31, 2022 
Printed Page 164

'Iteration over groups' Unclear to me. Missing details.

Holger Eich  Mar 31, 2024 
Printed Page 170
2nd paragaph

In the section Pivot Tables, the sentence " We have seen how the GroupBy abstraction let us..........", it should be "We have seen how the GroupBy abstraction work, let us.........."

Minhaz Uddin  Jul 31, 2022 
Printed Page 197
In[25], In[26], In[28]

This:
In[25]: from pandas_datareader import data
goog = data.DataReader('GOOG', start='2004', end='2016', data_source='google'
goog.head( )

In[26]: goog = goog['Close']

In[27]: %matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()

In[28]: goog.plot( )

Should be:
In[25]: from pandas_datareader import data
aapl = data.DataReader('AAPL', start='2004', end='2016', data_source='yahoo')
aapl.head( )

In[26]: aapl = aapl['Close']

In[27]: %matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()

In[28]: aapl.plot( )

---
Google Finance has discontinued its API, so this feature has been deprecated in the Pandas DataReader.
Therefore, financial data should be imported from Yahoo Finance instead.

Dyanne Ahn  Aug 14, 2020 
Printed Page 198-199
In[29], In[30]

This:
In[29]: goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
loc='upper left');

In[30]: fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]

data.asfreq('D').plot(ax=ax[0], marker='o')

data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);

Should be:
In[29]: aapl.plot(alpha=0.5, style='-')
aapl.resample('BA').mean().plot(style=':')
aapl.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
loc='upper left');

In[30]: fig, ax = plt.subplots(2, sharex=True)
data = aapl.iloc[:10]

data.asfreq('D').plot(ax=ax[0], marker='o')

data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);

---
Name 'goog' is not defined because Google Finance has discontinued its API, so this feature has been deprecated in the Pandas DataReader.
Therefore, financial data should be imported from Yahoo Finance instead.

Dyanne Ahn  Aug 14, 2020 
Printed Page 199-200
In[31], In[32]

This:
In[31]: fig, ax = plt.subplots(3, sharey=True)

goog = goog.asfreq('D', method='pad')

goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])

local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');

In[32]: ROI = 100 * (goog.tshift(-365) / goog - 1)
ROI.plot()
plt.ylabel('% Return on Investment');

Should be:
In[31]: fig, ax = plt.subplots(3, sharey=True)

aapl = aapl.asfreq('D', method='pad')

aapl.plot(ax=ax[0])
aapl.shift(900).plot(ax=ax[1])
aapl.tshift(900).plot(ax=ax[2])

local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');

In[32]: ROI = 100 * (aapl.tshift(-365) / aapl - 1)
ROI.plot()
plt.ylabel('% Return on Investment');

---
Name 'goog' is not defined because Google Finance has discontinued its API, so this feature has been deprecated in the Pandas DataReader.
Therefore, financial data should be imported from Yahoo Finance instead.

Dyanne Ahn  Aug 14, 2020 
PDF Page 201
2nd paragraph

In the following paragraph of Rolling Window
"Rolling statistics are a third type of time series–specific operation implemented by
Pandas. These can be accomplished via the rolling() attribute of Series and Data
Frame objects, which returns a view similar to what we saw with the groupby operation (see “Aggregation and Grouping” on page 158).This rolling view makes available a number of aggregation operations by default."

'rolling() attribute' should be corrected to 'rolling() method'

source: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Hongsoog Kim  Aug 31, 2017 
Printed Page 201
In[33]

This:
In[33]: rolling = goog.rolling(365, center=True)

data = pd.DataFrame({'input': goog,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)

Should be:
In[33]: rolling = aapl.rolling(365, center=True)

data = pd.DataFrame({'input': aapl,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)

---
Name 'goog' is not defined because Google Finance has discontinued its API, so this feature has been deprecated in the Pandas DataReader.
Therefore, financial data should be imported from Yahoo Finance instead.

Dyanne Ahn  Aug 14, 2020 
Printed Page 205
First Sentence / In[41]

The first sentence referenes pd.rolling_mean() but in In[41] line 2 sum() is called instead. In[41] line 3 then sets plt.ylabel to 'mean hourly count' which seems to be in accordance to the first sentence of the text but in opposition to the given code. This entire paragraph seems mixed up. Clarification?

Anonymous  Aug 21, 2017 
PDF Page 245
First code-block.

The code listing:
print(df5); print(df6); print(pd.concat([df5, df6])
Lacks a missing parenthesis ')' on the end, to close the print statement.

Nikolaj Gilstrøm  Oct 22, 2020 
Printed Page 263
line 9

plt.axes or fig.add_axes() numbers represent: [left, bottom, width, height] but in the book it's written as [bottom, left, width, height].

The reference here: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.figure.Figure.html

Anonymous  Dec 19, 2019 
Printed Page 283
In[3] 2nd line of code

ax = plt.axes(axisbg='#E6E6E6')

should be

ax = plt.axes(facecolor='#E6E6E6')

Documented on stackoverflow (not by me) at https://stackoverflow.com/questions/50504053/attributeerror-unknown-property-axisbg

Mark Pedigo  Jan 06, 2019 
Printed Page 311
Last paragraph

Seaborn provides an API on top of Matplotlib that offers sane choices.......


In this paragraph, instead of 'sane', it would be 'same'.

Minhaz Uddin  Aug 14, 2022 
PDF Page 344
last paragraph

In the last paragraph of page 344:
"Often one point of confusion is how the target array differs from the other features
columns. The distinguishing feature of the target array is that it is usually the quantity we want to predict from the data: in statistical terms, it is the dependent variable. For example, in the preceding data we may wish to construct a model that can predict the species of flower based on the other measurements; in this case, the species column would be considered the feature."

Since the species column contains the value to predict using model, it should be interpreted as target array rather than feature. The last sentence should be corrected as follows:
"For example, in the preceding data we may wish to construct a model that can predict the species of flower based on the other measurements; in this case, the species column would be considered the target array."

Hongsoog Kim  Sep 11, 2017 
PDF Page 350
Last code snippet

The last piece of code snippet on book is
----------------------
In[14]: plt.scatter(x, y)
plt.plot(xfit, yfit);
----------------------
While the xfit should be Xfit, that is X should be UPPERCASE, or the example code will throw exception.
BTW, this book is great, looking forward to the 2nd edition!

Timothy Liu  Jan 14, 2017 
PDF Page 351
1 paragraph of code


In[15]: from sklearn.cross_validation import train_test_split

should be replaced with

In[15]: from sklearn.model_selection import train_test_split

since method train_test_spit is part of the class model_selection not cross_validation !

Ahac  Jun 06, 2020 
PDF Page 358
Code near top of page

In the code:

ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')

suggest changing 'digits.images[i]' to 'Xtest[i].reshape(8,8)' so that images associated with ytest are displayed.

The writing and explanations in this book are clear and of the highest standard. Thank you!

Michael Laszlo  Mar 31, 2017 
PDF Page 363
1st paragraph

In the 1st paragraph of page 363,
"Because we have 150 samples, the leave-one-out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction. Taking the mean of these gives an estimate of the error rate:"

The mean of scores should be interpreted as 'estimate of the prediction accuracy' and corrected as follows:
"Because we have 150 samples, the leave-one-out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction. Taking the mean of these gives an estimate of the prediction accuracy:

Hongsoog Kim  Sep 12, 2017 
PDF Page 374
Input 18

Should read

`from sklearn.model_selection import GridSearchCV`

and not

`from sklearn.grid_search import GridSearchCV`

David Lindelof  Oct 25, 2019 
PDF Page 383
2nd equation

The denominator of the LHS of the equation should be P (L2 | features)

Michele Floris  Jan 20, 2017 
Printed Page 383
Second formula bellow the second paragraph.

The equation for a 2-label bayes classifier on page 383 has an incorrect subindex in the denominator of the left hand term which is 1 while it should be 2.

In latex:
The original equation:
\frac{P(L_{1}\mid features)}{P(L_{1}\mid features)} = \frac{P(features \mid L_{1}) \, P(L_{1})}{P(features \mid L_{2}) \, P(L_{2})}

The correct one:
\frac{P(L_{1}\mid features)}{P(L_{2}\mid features)} = \frac{P(features \mid L_{1}) \, P(L_{1})}{P(features \mid L_{2}) \, P(L_{2})}

Pablo Lorenzatto  Mar 28, 2017 
PDF Page 410
Code above Fig. 5-57

The `N` Parameter of `plot_svm` is not used, the first line of the function should be changed as:
```
def plot_svm(N=10, ax=None):
X, y = make_blobs(n_samples=N, centers=2,
random_state=0, cluster_std=0.60)

```

Michele Floris  Apr 03, 2017 
PDF Page 426
Code snippet

bag.fit(X,y) is not needed (model.fit is already included in visualize_classifier defined on page 423)

Michele Floris  Apr 13, 2017