Errata for HandsOn Machine Learning with ScikitLearn and TensorFlow
Submit your own errata for this product.
The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update
Version 
Location 
Description 
Submitted By 
Date Submitted 
Date Corrected 
PDF, ePub, Mobi, Safari Books Online 
In 'Execution Phase' 
In Chapter Ten, 'Execution Phase'
Text currently says
"Next, at the end of each epoch, the code evaluates the model on the last minibatch and on the full training set, and it prints out the result."
I believe it should read
"Next, at the end of each epoch, the code evaluates the model on the last minibatch and on the full test set, and it prints out the result."
Test not training.
Note from the Author or Editor: Indeed, it should be "test" instead of "training", good catch.

Kendra Vant 
Apr 08, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page Online
Chapter 11, Reusing Pretrained Layers 
Just a word order switch typo in the Note:
"More generally, transfer learning will work only well if the inputs have similar lowlevel features."
should rather be
"More generally, transfer learning will only work well if the inputs have similar lowlevel features."
'work' and 'only' reversed order.
Note from the Author or Editor: Thank you, indeed this is my French brain interfering with my writing! ;)

Kendra Vant 
Apr 08, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Chapter 16, Policy Gradients section, below "On to the execution phase!" 40% down in Online page 
The code for the execution phase is missing a parameter in the 'discount_and_normalize_rewards function: this function calls the discount_rewards function and assigns the return to 'all_discounted_rewards' but only passes one parameter, where discount_rewards expects two parameters. The github code for discount_and_normalize_rewards is correct, the online/Safari book code is incorrect.
def discount_and_normalize_rewards(all_rewards, discount_rate):
all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards] <<<ISSUE IS HERE, MISSING PARAM>>>
flat_rewards = np.concatenate(all_discounted_rewards)
reward_mean = flat_rewards.mean()
reward_std = flat_rewards.std()
return [(discounted_rewards  reward_mean)/reward_std for discounted_rewards in all_discounted_rewards]
Note from the Author or Editor: Good catch, thank you! I tested every single code example before adding it to the book, but it seems that I made a modification to the notebook and forgot to update the book. I fixed the error, it will be reflected in the digital versions within the next few weeks.

Steve Dotson 
Apr 16, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page ch 14
Equation 144. GRU computations 
Something is wrong with the last equation, h(t) = (1  z(t)) ¤ tanh (WxgT * h(t1) + z(t)¤gt)
I think it should be: h(t) = (1  z(t)) ¤ h(t1) + z(t) ¤ g(t)
Note from the Author or Editor: Yes indeed, you are absolutely right, I don't know what I was thinking when I wrote this equation, I apologize. The correct equation to compute h(t) is, as you wrote:
h(t) = (1  z(t)) ⊗ h(t1) + z(t) ⊗ g(t)
The equation in latexmath format is:
\mathbf{h}_{(t)}&=(1\mathbf{z}_{(t)}) \otimes \mathbf{h}_{(t1)} + \mathbf{z}_{(t)} \otimes \mathbf{g}_{(t)}
Thanks again for contributing to improving this book, I hope you are enjoying it.

Per Thorell 
May 06, 2017 
Jun 09, 2017 
Safari Books Online 
Ch. 2, Select a Performance Measure, 3rd bullet point 
Regarding the L0 norm, the text says "L0 just gives the cardinality of the vector (i.e., the number of elements)...". It may be clearer if the text says: "the number of nonzero elements"
Note from the Author or Editor: Good point, thanks. I actually fixed this a few weeks ago. The online version and the latest printed copies should be fixed by now. The sentence is now:
"ℓ0 just gives the number of nonzero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector."

Eric T. 
Jun 09, 2017 
Aug 18, 2017 
Other Digital Version 
kindle 1127
Above Figure 213 
On Figure 213 the axis values and legend is not shown. This due to a bug in #Matplotib inline on "scatter" plots. The attributes are hidden. Bellow you can find a temporary solution using sharex=False to restore visibility. The comment line cites the source for the solution.
housing2.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.4, s = housing2["population"]/100,
label = "population", c = "median_house_value", cmap = plt.get_cmap("jet"), sharex=False)
# sharex=False fixes a bug. Temporary solution. See: https://github.com/pandasdev/pandas/issues/10611
Note from the Author or Editor: I just love it when people come with both the problem and the solution! :)
I just tried your bug fix and it works fine, thanks a lot.

Wilmer Arellano 
Jun 05, 2017 
Jun 09, 2017 
Other Digital Version 
Kindle Loc 1141
After Figure 213 
Values obtained from running the code are different from what is printed on the book:
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population 0.026920
longitude 0.047432
latitude 0.142724
Name: median_house_value, dtype: float64
A previous table seems to indicate that the csv file is fine:
housing["income_cat"].value_counts() / len(housing)
3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64
Here running the code produces same results as the book.
Why the difference on the first table?
Thank you.
Note from the Author or Editor: Thanks for your feedback.
I am adding the following note to the Jupyter notebooks:
"You may find little differences between the code outputs in the book and in the Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I am currently adding notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book."
In this particular case, I think the difference is probably due to the fact that the training set was initially sampled differently (in fact, it had one more item). When I tweaked the notebook and ran it again, I updated the code and code outputs in the book, but I forgot to update a few outputs (probably because they look so similar). You may find a few other differences, but as I mentioned they really don't change the ideas discussed in the book. I recently fixed them, so the online and future paper reprints will be more consistent with the notebooks.
Thanks again!

Wilmer Arellano 
Jun 07, 2017 
Aug 18, 2017 
Safari Books Online 
Batch Gradient Descent
Bottom Box 
I find the discussion of convergence rate for Batch Gradient Descent a bit hard to follow. First of all, the relation between epsilon and convergence rate is never formally defined, so the simple math example you give does not immediately follow for me. I think the discussion would make more sense if it were written that the number of needed iterations is of order O(1/epsilon), which I'm pretty sure is correct.
Note from the Author or Editor: Good point, thanks for your feedback. This paragraph does need some clarification. I meant to say that the distance between the current point and the optimal point shrinks at the same rate as 1/iteration. However this depends on the cost function's shape (convergence is much faster if the cost function is strongly convex). I propose to replace the paragraph with this one:
When the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while: it can take O(1/epsilon) iterations to reach the optimum with a tolerance of epsilon, depending on the shape of the cost function. If you divide the tolerance by 10 to have a more precise solution, then the algorithm will have to run about 10 times longer.
If you are interested, this post by RadhaKrishna Ganti goes into much more depth:
https://rkganti.wordpress.com/2015/08/21/convergencerateofgradientdescentalgorithm/
Or this post by Sebastien Bubeck:
https://blogs.princeton.edu/imabandit/2013/04/04/orf523strongconvexity/
Or there is the "Convex Optimization" book by Stephen Boyd and Lieven Vandenberghe:
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

Anonymous 
Jul 09, 2017 
Aug 18, 2017 
Printed 
Page xiii
2nd paragraph 
The word "and" is misplaced in the second paragraph, first sentence. It currently reads:
"...and recommending videos, beating the world champion at the game of Go."
It should read:
"...recommending videos, and beating the world champion at the game of Go."
Note from the Author or Editor: Indeed, good catch, thanks! Fixed.

Daniel J Barrett 
Jan 30, 2018 
Oct 12, 2018 
Printed 
Cover spine 
The title of book is written on the spine as follows:
"HandsOn Machine Learning
with ScikittLearn & TensorFlow"
Scikit is mistakenly spelled with an extra "t".

Jeremy Joseph 
Feb 22, 2018 
Oct 12, 2018 
Safari Books Online 
chapter 5
sentence immediately before "Online SVMs" heading 
From book: "it’s an unfortunate side effects of the kernel trick."
Problem: "an" requires a singular noun, but "effects" is a plural noun.
Note from the Author or Editor: Indeed, thanks! I just fixed the mistake (an unfortunate side effects=>an unfortunate side effect).

Anonymous 
Mar 21, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 19
Equation 11 
"life_satisfaction" has been formatted like a formula definition, with extra space around each "f".
Note from the Author or Editor: Thanks. I updated the latex code:
Before:
life\_satisfaction = \theta_0 + \theta_1 \times GDP\_per\_capita
After:
\text{life_satisfaction} = \theta_0 + \theta_1 \times \text{GDP_per_capita}

anthonyelizondo 
Apr 26, 2017 
Jun 09, 2017 
PDF 
Page 19
Equation 11 
1st Edition 2nd Release,
\theta_0 is missing in Equation 11. :)

Haesun Park 
Jun 11, 2017 
Jun 12, 2017 
PDF 
Page 26
Last word in first line 
"loosing" (hyphenated across lines) should be losing.
Note from the Author or Editor: Thanks, this is fixed now.
Aurélien

C.R. Myers 
Sep 02, 2016 
Mar 10, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 30
Last paragraph 
In "No Free Lunch Theorem" side note,
"http://goo.gl/3zaHIZ" is broken.
I found another one,
https://www.researchgate.net/profile/David_Wolpert/publication/2755783_The_Lack_of_A_Priori_Distinctions_Between_Learning_Algorithms/links/54242c890cf238c6ea6e973c/TheLackofAPrioriDistinctionsBetweenLearningAlgorithms.pdf
Note from the Author or Editor: Thanks, indeed the page seems to have been removed. Perhaps linking to a Google Scholar search will be more stable: https://goo.gl/dzp946

Haesun Park 
May 12, 2017 
Jun 09, 2017 
Printed 
Page 30
Footnote 
The reference for the "no free lunch" paper has the author name spelled as Wolperts but should be Wolpert (no final "s").
Note from the Author or Editor: Good catch, thanks! Error fixed.

Marco Cova 
Apr 07, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 37
3rd paragraph (1st paragraph under "Select a Performance Measure") 
It is stated that "It [RMSE] measures the standard deviation of the errors the system makes in its predictions". This is incorrect; the standard deviation is the square root of the variance (as noted by the author in a footnote), and though similar to RMSE, is not quite the same as it. Note that standard deviation is an "averaged" measure of deviation from the mean of the values, while RMSE is an "averaged" measure of deviation between the values themselves. Standard deviation measures the "spread" of the data from the mean, while RMSE measures the "distance" between the values.
This makes the subsequent statement "For example, an RMSE equal to...of the actual value." incorrect as well.
Please view the answer here for a very clear explanation of this:
https://stats.stackexchange.com/questions/242787/howtointerpretrootmeansquarederrorrmsevsstandarddeviation
Note from the Author or Editor: You are absolutely correct, thanks for your feedback. I am currently working on the French translation of this book, and I actually stumbled across this sentence just last week: my heart almost stopped! It was a great disappointment to find such an error despite all my efforts to check and doublecheck everything. So far, the other errors had mostly been typos, but this one is serious. :(
I rewrote the paragraph like so: "It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. Equation 21 shows the mathematical formula to compute the RMSE."
The digital versions will be updated within a few weeks.
My sincere apologies,
Aurélien

Jobin Idiculla 
May 20, 2017 
Jun 09, 2017 
Other Digital Version 
37, 38, ...
All display equations 
I bought the book from Amazon, in Kindle format.
I'm not sure if this is an O'Reilly problem or a Kindle problem, but most of the display equations look terrible: The math font is about five times larger than the text font, the symbols overlap, and some equations are clipped so that they are illegible.
(Other than that, I'm very happy with the actual content of the book.)
Note from the Author or Editor: Thanks for your feedback. I'm really sorry about this issue, I just forwarded your message to the production team at O'Reilly, I'll get back to you as soon as they answer (they are usually very responsive). Could you please specify which Kindle model you have exactly, it might be specific to a particular model, I'm not sure (I don't have a Kindle, so I can't reproduce the issue).
In the meantime I'll extract all the math equations from the book and post them to the github project (https://github.com/ageron/handsonml).
Hope this helps,
Aurélien

Anonymous 
Jun 21, 2017 
Jul 07, 2017 
Other Digital Version 
39
Towards end (third of a series of bullet points about norm definitions) 
In Kindle edition, the inline formula for the l_k norm is unreadably small. (Earlier on the page, the formula for the Mean Absolute Error is enormous, but this is not a problem, just slightly poor formatting).
Note from the Author or Editor: Thanks for your feedback, and I'm very sorry for the problem you are experiencing.
We had this problem before, but I thought it was fixed around September. If you bought the book before that, could you please try updating it, hopefully this should fix the issue.
I will report this issue nonetheless to O'Reilly, just in case the problem came back for some reason. If this is so, then I will update this message.
When we had equation formatting problems last summer, I created a Jupyter notebook containing all the book's equations. You can get it here:
https://github.com/ageron/handsonml/blob/master/book_equations.ipynb
Note that github's renderer does not display some of the equations properly, unfortunately, but if you download the notebook and run it in Jupyter, it will display the equations perfectly.
Thanks again for your feedback, and I hope you are enjoying the book despite this formatting issue.
Aurélien

Liam Roche 
Nov 23, 2017 
Oct 12, 2018 
PDF 
Page 39
The second paragraph below Equation 22 
The "Euclidean norm" is misspelled as "Euclidian norm".

Anonymous 
Oct 04, 2018 
Oct 12, 2018 
Printed 
Page 41
second chunk of code in the box 
At least on my system (Ubuntu/Kubuntu), pip3 install user installs virtualenv command in ~/.local/bin, which is not in my PATH. Calling virtualenv provokes a response that the user should install it using sudo aptget install virtualenv. Doing that leads to problems with mixing versions. So a note on adding ~/.local/bin to the PATH could be useful for inexperienced python programmers like myself  both in the book and on the github page. BTW, you complained in the errata that \mathbf{\theta} did not work. It should be \bm{\theta}.
Note from the Author or Editor: Hi Jan,
Thanks for your feedback. I'm sorry you had trouble with the installation instructions: I actually hesitated to add any installation instructions to my book, because it's really the sort of things that varies a lot across systems, and changes over time. I'll add a footnote as you suggest, it's a great idea.
Regarding the bold font theta, someone suggested using \bm instead of \mathbf a while ago, and I tried, but it did not work. For example, go to latex2png.com and try running "x \mathbf{x} \bf{x} \theta \mathbf{\theta} \bf {\theta}". I see a normal x, then 2 identical bold x, then 3 identical normal thetas. O'Reilly ended up converting many of the equations to MathML, and then it worked fine.
Cheers,
Aurélien

Jan Daciuk 
Nov 15, 2017 
Oct 12, 2018 
Printed 
Page 45
Figure 26 
(First Edition)
In a screenshot of figure 26,
housing.info should be housing.info() as it is in the notebook on github
Note from the Author or Editor: Thanks for your feedback. The parentheses are actually in very light green in the original image, and when converted to black & white for the printed version, they almost disappear (if you look closely, you can barely see them in very light gray).
I've updated the image and contacted the production team to make sure they'll include the new image in future printed editions.

Haesun Park 
May 16, 2017 
Jun 09, 2017 
Printed 
Page 45
Figure 25 
When I run the code in Figure 25 I get a FileNotFoundError:
file b'datasets\\housing\housing.csv' does not exist.
The code calls load_housing_data but I don't see where fetch_housing_data is called. You have to fetch the data to create the datasets/housing directory. What might I be missing?
Note from the Author or Editor: Thanks for your question. Indeed, you need to call the fetch_housing_data() function, or else the load_housing_data() function will not find the data file (housing.csv). Just below the definition of the fetch_housing_data() function, I wrote "Now when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory". Perhaps I should have been more explicit: "You should now call the fetch_housing_data() function: it will create a datasets/housing directory in your workspace...", etc.
Hope this helps.

Iain Watson 
Dec 01, 2018 
Dec 07, 2018 
Printed 
Page 46
after 2nd paragraph 
The call to `value_counts()` is displayed as being executed in a standard Python REPL (e.g. with a `>>>` prompt) without any explanation. The use of the Python REPL continues on pages 49, 52, 56, etc.
Perhaps it is worth clarifying that `>>>` implies you can enter the code in the Jupyter notebook or in the REPL.
Note from the Author or Editor: Thanks for your feedback.
Regarding the usage of >>>, I use it for better readability when there's a mix of code and outputs.
For example, consider the following code:
a = 1
b = a + 3
c = a * b
Say I want to show the value of b, I could write:
a = 1
b = a + 3
print(b) # => 4
c = a * b
But that's a bit ugly, especially if the value of b is long or spans multiple lines. So instead I could do something like this:
Code:
a = 1
b = a + 3
print(b)
c = a * b
Output:
4
But then the reader has to go back and forth between the code and the output to understand everything. So perhaps this instead?
a = 1
b = a + 3
print(b)
# 4
c = a * b
That's not bad, actually, but I prefer the >>> notation, because it's more common for python code, it looks like I copy/pasted a piece of python console:
>>> a = 1
>>> b = a + 3
>>> b
4
>>> c = a * b
Now it looks exactly like what you would get in the interpreter, so hopefully it's both clear and natural.
But when there's nothing particular to display, I don't use >>>, I simply write the code. Perhaps this is what confused you? Why do I use this notation sometimes and not other times? I guess I could add a footnote for the first code example that uses this notation, something like this:
When a code example contains a mix of code and outputs, I will use the same format as in the python interpreter, for better readability: the code is prefixed with >>> (or ... for indented blocks), and the outputs have no prefix.
Thanks for the suggestion,
Cheers,
Aurélien

Anonymous 
Dec 19, 2017 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 47
2nd paragraph 
The text states "slightly over 800 districts have a median_house_value equal to about $500,000.
I suppose you meant "slightly over 1,000 districts", looking at the peak in the relevant histogram (the xaxis numbers overlap in the book, but it's the lonely peak at the right).
Unless you consider 1,000+ to also be "slightly over 800" :)
Note from the Author or Editor: Good catch, thanks! I actually meant to write "equal to about $100,000".

Wouter Hobers 
May 11, 2017 
Jun 09, 2017 
Printed 
Page 50
4 
there's a minor error on page 50 that produces a bug for python version < 3
This line has the hash() function that returns the ascii code number in python 3:
return hash(np.int64(identifier)).digest()[1] < 256 * test_ratio
in python version 2 this returns a character which breaks the entire function.
Replacing the line with this fixes it:
if sys.version[0] == '3':
return hash(np.int64(identifier)).digest()[1] < 256 * test_ratio
else:
return ord(hash(np.int64(identifier)).digest()[1]) < 256 * test_ratio
Cheers
Note from the Author or Editor: Thanks for your feedback. Indeed, this function only works with Python 3.
In the notebook, I proposed a version that supports both Python 2 and 3:
def test_set_check(identifier, test_ratio, hash):
return bytearray(hash(np.int64(identifier)).digest())[1] < 256 * test_ratio
It's kind of ugly, so I decided to just present the Python 3 version, but I should have added a comment to make it clear.
Side note: most scientific python libraries have announced that they will stop supporting Python 2 very shortly (e.g., NumPy will stop releasing new features in Python 2 at the end of this year, see https://python3statement.org/ for more details).
One problem with the implementation above is that it uses the MD5 hash and only looks at a single byte, so the cut between train and test is rather coarse. Since then, I found a better option using CRC32 (much faster and returning 4 bytes, so it's much more fine grained), which I will be proposing in future releases:
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
This works just as well on Python 2 and Python 3 (in Python 3, you could remove "& 0xffffffff", which is only needed because crc32() returns a signed int32 in Python 2, while it is unsigned int32 in Python 3).
Hope this helps!
Aurélien

Anonymous 
Apr 01, 2018 
Oct 12, 2018 
Printed 
Page 51
first full paragraph, 5th line 
Hi,
The third sentence of the first full paragraph on page 51 ends with "...they don't just pick 1,000 people randomly in a phone booth."
That's true, but I suspect you intended to write "phone book." (It's hard to fit 1K people into a phone booth.)
Thanks so much for writing this book!
Best,
Jeff
Note from the Author or Editor: Hi Jeff,
Ha ha, very funny! :) I fixed the error, thanks a lot for your feedback and your sense of humor.
Cheers,
Aurélien

Jeff Lerman 
Nov 28, 2017 
Jan 19, 2018 
Printed 
Page 51
3rd paragraph 
You say "most median income values are clustered around $20,000$50,000, but some media incomes go far beyond $60,000." However, as you mention on page 48, median income in not expressed in US dollars, e.g. "it has been scaled and capped at 15."
It would be clearer to refer to the scaled values since we don't know how they map to US dollars.
Note from the Author or Editor: Thanks for your feedback. Indeed, I forgot to mention that the median income values represent roughly tens of thousands of dollars (from 1990), so for example 3 actually represents roughly $30,000. My apologies! I updated the book to make this clear.
Hope this helps,
Aurélien

Anonymous 
Dec 19, 2017 
Oct 12, 2018 
Printed 
Page 51
2nd paragraph 
This is an erratum about the errata!
Many of the pages listed for errors in the printed version are incorrect. For example, the error that is reported as being on p. 73 (about Figure 28) is actually on p. 51.
Note from the Author or Editor: Thanks for your feedback. I fixed all the early errata, and sometimes this resulted in slightly longer or shorter paragraphs, so the text layout had to be adjusted. As a result, the pages mentioned in the earlier errata are slightly off (usually by a couple pages) in the latest releases. Since these errors concern only the earlier releases, we should probably keep the page numbers from these releases, don't you think? I'll talk to O'Reilly about this to see what we can do.

Peter Drake 
Mar 01, 2018 
Oct 12, 2018 
Printed 
Page 52
Figure 29 
It would be nice to show the command used to generate "Figure 29 Histogram of income categories", perhaps in a footnote.
Note from the Author or Editor: Good suggestion, thanks. I added the line of code that plots this histogram:
housing["income_cat"].hist()

Anonymous 
Dec 19, 2017 
Oct 12, 2018 
Printed 
Page 52
2nd & 3rd Paragraph 
The 3rd paragraph at page 52, i.e:
"Let's see...
...
...
... float64"
should be placed before the second paragraph, i.e.:
"Now...
...
...
... test_index]"
Note from the Author or Editor: Thanks for your feedback. The two paragraphs should not be inverted: at the end of page 51, we have just created the income_cat attribute. Then the "Now you are ready..." paragraph creates the training set and test set (strat_train_set and strat_test_set) using stratified sampling.
Finally, we want to check whether or not stratified sampling actually respected the income category proportions of the full set. For this, we start by showing how to measure the proportions on the full set, and we explain that the same can be done to measure the proportions on the test set that we just generated.
However, I understand that it can be confusing to say "let's see if this worked" and not explicitly use what we just generated in the code example, so I will replace "housing" with "strat_test_set" in the second example code on page 52 to make things clearer, and I will replace "in the test set" with "in the full dataset" in the sentence just after the code example, like this:
"""
With similar code you can measure the income category proportions in the full dataset.
"""
Thanks for helping clarify this page!

Panos Kourdis 
Oct 15, 2017 
Nov 03, 2017 
Printed 
Page 52
1st paragraph 
Dear Aurelien, I will start by thanking you for this amazing book!! I am imbibing it like a magical stream of knowledge. Loving the examples and the writing style. My desire is to understand the handson part in its entirety and for that reason when facing challenges I get stuck with my brain unwilling to move past a paragraph, no matter how "insignificant" it might be in the grand scheme of ML.
Chapter 2, Create a Test Set (Ninth Release, 20181012)
The passage that I would love to see clarified is: "The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil..", the hard part for me to comprehend is "divide by 1.5" (most people would either understand why; the other half would probably not even care) why was the number of "1.5" selected for division? Perhaps the text within the parentheses could be expanded to include, " in your future work you will have to pick a number that would be close to the lower cutoff value, with the lower cutoff value in our example being 2", or (if I misunderstood the meaning of 1.5) it would instead include "remember the value of 1.5 because this is the goldstandard number the universe has reserved for this purpose)
My (poor) understanding of this division by 1.5 is based on the context of the median_income that you have outlined "most median income values are clustered around 2 to 5" so you are picking the divisor as a number that is close to 2. Am I right?
I have another idea, that perhaps, it would be easier to "bin" the values per range based on [0.0, 2.0, 3.0, 4.0, 5.0] (while including the outofbound values) using pd.cut()?
Note from the Author or Editor: Thanks for your feedback and your kind words, I'm really glad you are enjoying my book!
I was just trying to define some useful strata. If you look at figure 28, you see that most incomes are between 1 and 9 (tens of thousands of dollars), with the bulk between 1.5 and 6. It seemed reasonable to define 5 strata, from 0 to 1.5, then 1.5 to 3, then 3 to 4.5, then 4.5 to 6, and finally 6 and above. By dividing the income by 1.5, rounding up and cropping above 5, that's exactly what I get. The following code is equivalent, and would probably have been clearer:
housing["income_cat"] = pd.cut(
housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
I'll clarify this paragraph. Thanks again!

Tim B. 
Feb 09, 2019 
May 24, 2019 
Printed 
Page 57
2nd paragraph 
Hi,
First of all, thanks for this great book! I have been recommending it to all my colleges who are interested in Machine Learning.
Not sure if the type of error I selected is appropriate, or if this is considered as an error at all; but on page 57 we import scatter_matrix as follows:
from pandas.tools.plotting import scatter_matrix
as of Pandas 0.20 pandas.tools.plotting has beed deprecated and pandas.plotting should be used instead.
Note: I'm running jupyter within the tensorflow/tensorflow:latestpy3 docker container, which comes with latest most common data science python libs already installed.
Reference: https://hub.docker.com/r/tensorflow/tensorflow/
Note from the Author or Editor: Hi Gabriel,
Thanks for your very kind words, I'm glad you are enjoying my book.
Indeed, the scatter_matrix() function was moved in Pandas 0.20. I updated both the Jupyter notebook and the book.
Thanks for your feedback,
Aurélien

Gabriel Nieves Ponce 
Nov 11, 2017 
Jan 19, 2018 
Printed 
Page 66
Middle in the page 
(1st Edition)
Last sentence of the paragraph below a code block.
"The names can be anything you like."
But actually step name can't include double underscore(__). :)
Note from the Author or Editor: Indeed, the only constraint is that it should not contain double underscores, thanks for pointing it out.

Haesun Park 
May 21, 2017 
Jun 09, 2017 
Printed 
Page 66
Code sample 
For the custom transformer, the variable for the index of the households feature is named "household_ix". For consistency, I recommend it be named to "households_ix", since the other indices match the pluralization of their respective features (rooms_ix and bedrooms_ix).
Note from the Author or Editor: Good point, thanks. I updated the code to replace household_ix with households_ix.

Charley Grossman 
Oct 25, 2018 
Dec 07, 2018 
Printed 
Page 67

In the forth paragraph it starts "Now it would be nice if we could feed a Pandas DataFrame directly into our pipeline". This could be confusing because this is actually what we just did a few lines above when we called num_pipeline.fit_transform(housing_num), because housing_num is a Pandas DataFrame. Could be reworded/clarified a bit.
Note from the Author or Editor: Good point! What I meant is that it would be nice to be able to pass a Pandas DataFrame containing nonnumerical attributes directly into our pipeline. I'll correct the sentence accordingly, thank you very much for your feedback.

Michael Padilla 
Oct 11, 2017 
Nov 03, 2017 
PDF 
Page 67
2nd paragraph 
The text reads: "Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance." I think that it should be "..it divides by the standard deviation..". According to the StandardScaler source code it also divides by the standard deviation and not the variance.
Note from the Author or Editor: Good catch! Of course you are right, first subtract the mean, then divide by the standard deviation, not the variance. Thanks a lot.

Anonymous 
Feb 08, 2018 
Oct 12, 2018 
Printed 
Page 67
The warning section 
The text reads "Only then you can use them to transform the training set and the test set (and new data)." I think it makes more sense to replace the training set with validation set here?
Note from the Author or Editor: Thanks for your feedback. I was thinking something like this:
scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)
scaler.transform(X_validation)
scaler.transform(X_test)
scaler.transform(X_new)
We must only fit the scaler to the training set, but then we can use it to transform all the data (training set, validation set, test set, new data).
However, it's true that very often, we fit the training set and transform it in just one operation:
X_train_scaled = scaler.fit_transform(X_train)
But even then, we are fitting the training set and then using the fitted scaler to transform the training set. It's just that it's happening in one method call instead of two.
I'll file this as "request for clarification", as I don't think it's a mistake, but I'll try to clarify that sentence. Thanks again.

Mika Qvist 
May 10, 2018 
Oct 12, 2018 
Other Digital Version 
68
3rd paragraph 
Hello Sir,
In the 2nd chapter "Endtoend Machine Learning project" under the section "Get the data" in the subsection "Take a quick look at the data structure" , the lines read as:
"When you looked at the top 5 rows, you noticed that the values in that column were repetitive, which means that it is probably a categorical attribute ".
I believe it should read as:
"When you looked at the top 5 rows, you noticed that the values in the "ocean_proximity" column were repetitive, which means that it is probably a categorical attribute ".
It was a little difficult to spot which columns were repetitive from the book. Had to refer the jupyter notebook for spotting that column.
Note from the Author or Editor: I can see how this can be confusing, thanks for pointing it out. Yes, I replaced "in that column were repetitive" with "in the `ocean_proximity` column were repetitive".

Navin Kumar 
May 27, 2017 
Jun 09, 2017 
Printed 
Page 68
code snippet under "And you can run the whole pipeline simply:" 
Hello Mr. Geron,
In Chapter 2, in the very last bit of the 'Transformation Pipelines' section, you run the line:
>>> housing_prepared = full_pipeline.fit_transform(housing)
But when I actually attempt to run this code in the notebook, I get the following error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
I can't seem to find a fix to this error, more specifically, I'm not quite sure where it's getting 3 arguments from.
I didn't make any changes anywhere in the notebook code and have just been running through the blocks sequentially, following along with the text. Why would it all of a sudden break at this point?
Any help would be greatly appreciated!
Thank you!!
Note from the Author or Editor: Thanks for your feedback, and my apologies for the late response, I've had a very busy summer.
The LabelEncoder and LabelBinarizer classes were designed for preprocessing labels, not input features, so their fit() and fit_transform() methods only accept one parameter y instead of two parameters X and y. The proper way to convert categorical input features to onehot vectors should be to use the OneHotEncoder class, but unfortunately it does not work with string categories, only integer categories (people are working on it, see Pull Request 7327: https://github.com/scikitlearn/scikitlearn/pull/7327). In the meantime, one workaround *was* to use the LabelBinarizer class, as shown in the book. Unfortunately, since ScikitLearn 0.19.0, pipelines now expect each estimator to have a fit() or fit_transform() method with two parameters X and y, so the code shown in the book won't work if you are using ScikitLearn 0.19.0 (and possibly later as well). Avoiding such breakage is the reason why I specified the library versions to use in the requirements.txt file (including scikitlearn 0.18.1). A temporary workaround (until PR 7327 is finished and you can use a OneHotEncoder) is to create a small wrapper class around the LabelBinarizer class, to fix its fit_transform() method, like this:
class PipelineFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)
I'm updating the notebook for chapter 2 to make this clear.
Thanks again for your feedback. :)

Anonymous 
Aug 18, 2017 
Nov 03, 2017 
Printed 
Page 68
2nd paragraph 
Hello,
The text says on page 68, second paragraph: "...(it also has a fit_transform method that we could have used instead of calling fit() and then transform()).
The code for the pipeline that 'it' refers to is given on the previous page as
housing_num_tr = num_pipeline.fit_transform(housing_num)
The first passage quoted above is therefore incorrect as 'fit_transform' was in fact used. It's only mildly confusing when reading :)
Thanks,
Michael
Note from the Author or Editor: Good catch, it was indeed confusing. I fixed the sentence like this:
The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a `StandardScaler`, which is a transformer, so the pipeline has a `transform()` method that applies all the transforms to the data in sequence (and of course it also has a `fit_transform()` method, which is the one we used).
Thanks!

Michael Heitmeier 
Jan 03, 2019 
Mar 08, 2019 
Printed 
Page 69
First Code Snippet 
I was working through the EndtoEnd Machine Learning Project in Chapter 2 and ran into an issue with the CategoricalEncoder. I kept getting an error that I couldn't import it despite having the most recent version of python. A quick internet search revealed that they have considered no longer supporting this functionality, so I couldn't find a place to update my package with this functionality. I was able to get the code working by looking up the previous code involving the LabelBinarizer, and then using the errata on a previous post about this page. Hope you can address this in future editions.
Thanks for a great book.
 Weston Ungemach
Note from the Author or Editor: Thanks for your feedback and your kind words. Onehot encoding is a bit of a mess right now in ScikitLearn: the LabelBinarizer is really only meant for labels, not for input features, even though it's possible to use it by hacking a bit. The CategoricalEncoder from the upcoming 0.20 version of ScikitLearn used to work well (I copied it in my notebooks and it was fine), but there's a discussion going on right now about replacing it with another class, which may be named OneHotEncoder (replacing the existing one) or DummyEncoder, or perhaps something else. See the discussion here:
https://github.com/scikitlearn/scikitlearn/issues/10521
In the meantime, you can use the code from the notebook in chapter 2. It works well. If you need to use it in your project, just save it to a file such as categorical_encoder.py and import from that file. Then when the ScikitLearn team decides what to do in 0.20, you can probably do a simple update of the imports, class name and parameter names, but the functionality should remain the same... I hope!
I will definitely address this in future editions, but it's hard to know in what direction they will go.
Hope this helps,
Aurélien

Weston Ungemach 
Mar 11, 2018 
Oct 12, 2018 
PDF 
Page 69
Last line of page 
room_ix should be rooms_ix
 bedrooms_per_room = X[:, bedrooms_ix] / X[:, room_ix]

+ bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
Note from the Author or Editor: Great catch. Thanks. This error is now fixed.
Best regards,
Aurélien

Miles Thibault 
Dec 18, 2016 
Mar 10, 2017 
ePub 
Page 72
2nd paragraph (code) 
I believe that the denominator in the equation below is incorrect. Should be dividing by households rather than population.
ERROR > housing["rooms_per_household"] = housing["total_rooms"]/housing["population"]
CORRECT > housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
Note from the Author or Editor: Thanks a lot for your feedback. I fixed the error, it will disappear from the electronic versions shortly, and the printed copy will not contain it.
Best regards,
Aurélien

Liam Culligan 
Mar 06, 2017 
Mar 10, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 73
4th paragraph ( last line) 
Hello Sir,
In the 2nd chapter "Endtoend Machine Learning project" under the section
"Get the data" in the subsection "Create a test set"  The lines read as :
"Suppose you chatted with experts who told you that the median income is a very
important attribute to predict median housing prices. You may want to ensure that
the test set is representative of the various categories of incomes in the whole dataset.
Since the median income is a continuous numerical attribute, you first need to create
an income category attribute. Let’s look at the median income histogram more closely
(see Figure 29):"
The last line (see Figure 29) I believe should be as :
( see Figure 28)
Because the subsequent line says :
"Most median income values are clustered around 25 (tens of thousands of dollars),
but some median incomes go far beyond 6".
In Figure 29 , the median income is capped at 5. In Figure 28 the median income go beyond 6.
I am sorry the page numbers does not seem to match. Hence the long message. I am using a predraft version from safari online.
Note from the Author or Editor: Good catch, you are correct, thanks a lot. I fixed this. Indeed, instead of (see Figure 29), the text should read (see Figure 28).

Navin Kumar 
May 27, 2017 
Jun 09, 2017 
PDF 
Page 73
15th line 
From Scikitlearn 0.18, train_test_split is included in sklearn.model_selection.
Note from the Author or Editor: Thanks a lot for your feedback. This error is now fixed.
Best regards,
Aurélien

Daisuke 
Oct 31, 2016 
Mar 10, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 83
Paragraph 1 in Implementing cross validation box 
In the print version, the cross_val_score() function is not defined until page 84. If you are working through a notebook as you go, this created a function not defined error.
"same thing as the preceding cross_val_score() code".
"preceding" should be "following"
Note from the Author or Editor: Good catch, thank you. I think the text was initially in the right order, but the "Implementing CrossValidation" section had to be moved around for pagination reasons. I fixed the text to make things clearer:
"""
Occasionally you will need more control over the crossvalidation process than what ScikitLearn provides offtheshelf. In these cases, you can implement crossvalidation yourself; it is actually fairly straightforward. The following code does roughly the same thing as ScikitLearn's `cross_val_score()` function, and prints the same result:
"""

Stephen Jones 
Apr 28, 2017 
Jun 09, 2017 
Printed 
Page 83

On page 69, 83 and 124, it is said that crossvalidation can be used to validate a model.
But in method cross_validation_score() on page 83, the model itself (sgd_clf) is not evaluated at all. It is cloned to clone_clf and modified (by fit method). So the evaluated model is a new model, not the one passed into cross_validation_score.
To summarize, as per my understanding, crossvalidation is used to evaluate learning algorithms and their hyperparamerers. To validate a model, we should use test set.
Thank you.
Note from the Author or Editor: Thanks for your feedback.
Indeed, you are correct, there is some ambiguity when I say "evaluate a model": in some cases I mean "evaluate the choice of model architecture & hyperparameters" and in other cases I mean "evaluate an actual trained model, with its architecture & trained parameters".
The former (crossvalidation) is typically done on the training set: in Kfold CV, the training set is split into K pieces, and the same model *architecture* is trained K times (on the training set minus piece #i, for i in [1, K]), and then evaluated on the piece it was not trained on.
The latter (evaluation of a trained model) is typically done on the validation set (when not using crossvalidation) for model selection (which should be called model architecture & hyperparameter selection), or on the test set (to evaluate the generalization error).
I'll see what I can do to clarify this, thanks a lot for bringing this problem to my attention.

Donald Zhang 
Feb 03, 2019 
Mar 08, 2019 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 86
Code sample 
>>> precision_score(y_train_5, y_pred)
Should be
>>> precision_score(y_train_5, y_train_pred)
Note from the Author or Editor: Good catch, thanks. I tested every line of code before adding it to the book, but I guess I must have renamed this variable in the notebook at one point, and when I updated the book I missed a couple occurrences. Sorry about that!
Note that there's the same problem a few lines below:
>>> f1_score(y_train_5, y_pred)
should be:
>>> f1_score(y_train_5, y_train_pred)
I fixed these issues, but it may take a while for them to propagate to the digital version.

Stephen Jones 
Apr 28, 2017 
Jun 09, 2017 
Mobi 
Page 86
1st paragraph (Chapter 2, Frame the Problem, Paragraph 4) 
The published book states "More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). In the first chapter, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem.”
Instead of multivariate regression, it should be multiple regression. Multivariate regression is where there is more than 1 dependent variable, while multiple regression refers to more than 1 predictor/independent variable  which is this case.
Note from the Author or Editor: Oops, indeed you are right, I should have said "multiple", not "multivariate", I just fixed this.
Thanks!

Sean 
Nov 13, 2018 
Dec 07, 2018 
Printed 
Page 89
body of plot_precision_recall_vs_threshold() function 
super minor issue. the body of the plotting function sets the location of the legend to "upper left" while the image shows the legend location at "center left".
for fix, simply change:
plt.legend(loc='upper left')
to:
plt.legend(loc='center left')
PS  great book btw :)
Note from the Author or Editor: Good catch! :) Indeed, for some reason I changed the code from "center left" to "upper left" at one point, and I did not update the figure, not sure why. I'll revert to "center left", thanks for pointing this out.
Cheers,
Aurélien Géron

Anonymous 
Jun 17, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 93
code sample 
plt.legend(loc="bottom right")
should be
plt.legend(loc="lower right")
Note from the Author or Editor: Good catch, thanks.

Stephen Jones 
Apr 28, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 95
code sample 
>>> sgd_clf.classes[5]
should be
>>> sgd_clf.classes_[5]
Note from the Author or Editor: Good catch, thanks.

Stephen Jones 
Apr 28, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 98
code sample 
plot_digits function is not defined in the book, only in the corresponding notebook. This is problematic when working through the book only.
Note from the Author or Editor: The `plot_digits()` function is really uninteresting, it just plots an image using Matplotlib. I preferred to leave it out of the book to avoid drowning the reader in minor details. However, I agree that I should have added a note about it, for clarity. I just added the following note:
"(the `plot_digits()` function just uses Matplotlib's `imshow()` function, see this chapter's Jupyter notebook for details)"

Stephen Jones 
Apr 28, 2017 
Jun 09, 2017 
Printed 
Page 100
the last paragraph 
The last sample code on page 100,
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
>>> f1_score(y_train, y_train_knn_pre, average='marco'),
may be corrected to
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
>>> f1_score(y_multilabel, y_train_knn_pred, average='marco').
Note from the Author or Editor: Thanks for your feedback, you are absolutely right (with one minor tweak: it's "macro", not "marco"). I updated the book and the jupyter notebook.
Cheers,
Aurélien

Anonymous 
Jul 05, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 101
code block in bottom third of page 
In the code example, the random noise generated for the training set is overwritten with noise for the test set before it is applied:
noise = rnd.randint(0, 100, (len(X_train), 784))
noise = rnd.randint(0, 100, (len(X_test), 784))
X_train_mod = X_train + noise
X_test_mod = X_test + noise
Just switch the second and third line to get it right (as it is in the notebook on github):
noise = rnd.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = rnd.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
Note from the Author or Editor: Good catch, thanks! Indeed, it should be written as you indicate, just like in the notebook. I'm not sure how this error happened. I fixed it now (but it may take a while to propagate to the digital versions).

Lars Knipping 
May 09, 2017 
Jun 09, 2017 
Printed 
Page 103
4th exercise 
1st edition, 1st release
4th exercise in chapter 3,
https://spamassassin.apache.org/publiccorpus/
>
https://spamassassin.apache.org/old/publiccorpus/
Note from the Author or Editor: Good catch, thanks. Indeed, the old link is now broken, it should be replaced with:
http://spamassassin.apache.org/old/publiccorpus/
I'll update the book.
Cheers,
Aurélien

Haesun Park 
Jul 06, 2017 
Aug 18, 2017 
Printed 
Page 107
1,3 
For consistency, the greek letters θ and Θ, since they are representing a vector and matrix quantity, respectively, should be boldface. Unless there is a literature or specified in the book convention which I am missing.
Note from the Author or Editor: Thanks for your feedback. You are right that these thetas should be in bold font since they represent vectors and matrices. I actually wrote the equations in the book using LatexMath, and I did write \mathbf{\theta} or \mathbf{\Theta} everywhere (except when they represent scalars, such as \theta_0, \theta_1, and so on, but it seems that the bold font did not always show up in the rendering phase, for some reason. Try rendering \theta \mathbf{\theta} \Theta \mathbf{\Theta} using latex2png.com, and you will see that the second theta is not rendered in bold font. I suspect that not all fonts support bold font thetas, and O'Reilly used a rendering tool based on such a font.
This was partly solved by converting equations to MathML, but it's a tedious manual process, and it seems we have missed a few. I will continue to try to fix all missing bold fonts. In the meantime I hope readers will not be too confused, hopefully the text makes it clear that we are talking about vectors and matrices.
Thanks again!

Panos Kourdis 
Oct 20, 2017 
Nov 03, 2017 
Printed 
Page 109
4th paragraph 
array([[4.21509616],[2.77011339]])
should be
array([[3.86501051],[3.13916179]])
Note from the Author or Editor: Thanks for your feedback. Yes, I tried to make the Jupyter Notebooks' output constant across multiple runs, but I forgot a few "np.random.set_seed(42)" and "random_state=42" and "tf.set_random_seed(42)" here and there, so unfortunately the outputs vary slightly across multiple runs. I'm fixing the notebooks now, so that they will actually be constant, but there's no way to make them output the same thing as the first edition of the book. So I'm fixing the book so that at least the next reprints will be consistent with the (stable) notebooks. Arrrrgh...
That said, the differences are quite small in general, so although I believe it should be possible for readers to follow along despite the minor differences.

Anonymous 
Jun 05, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 109
Below the first code block 
1st edtion, 1st release
y = 4 + 3x_0 + Gaussian noise
Should be
y = 4 + 3x_1 + Gaussian noise
Note from the Author or Editor: Good catch, it should indeed be x_1 instead of x_0. Fixing the book now.
Thanks!
Aurélien

Haesun Park 
Jul 05, 2017 
Aug 18, 2017 
Printed 
Page 110
The first paragraph in Computational Complexity section 
The inverse of dot product of X.transpose and X is an n by n matrix. Should not it be (n+1) x (n+1)? Because X is m x (n+1) matrix.
Note from the Author or Editor: Good catch! Yes, X^T X is an (n+1) x (n+1) matrix, not n x n. Fortunately, it does not change the computational complexity of the normal equation, it's still between O(n^2.4) and O(n^3).
By the way, I rewrote part of this section for the next release because I oversimplified it: in particular, ScikitLearn's LinearRegression class uses an algorithm based on SVD (matrix decomposition) rather than the Normal Equation: y_pred = np.linalg.pinv(X).dot(y). (this uses the MoorePenrose pseudo inverse, which is based on SVD).
SVD has a computational complexity of O(m n^2), so it's significantly better than the Normal Equation (but it does not change the conclusions of this section: this class does not support outofcore, training is linear with regards to the number of instances (m) but quadratic with regards to the number of features (n), so it's slow when there are very many features (e.g. for large images).
Thanks a lot for your feedback!

Anonymous 
Dec 21, 2017 
Oct 12, 2018 
Printed 
Page 110
Referring to the whole section 'The Normal Equation' 
The normal equation is stated to determine the parameters of the linear regression model:
\theta = (X^T X)^{1} X^T y
Later, the computational complexity of calculating the inverse is mentioned. But, there is an alternative not mentioned in the book. One can determine \theta as the solution of the linear equation
(X^T X) \theta – X^T y = 0
This should be explained in the book, as well. Later, in the section “Linear regression with TensorFlow” the same mistake is made. In case this is on purpose, because the calculation shows well how to use TensorFlow, you should at least mention it.
Best regards and thank you for the great book,
Niclas
Note from the Author or Editor: Indeed, you are right, the Normal Equation is not the only way to determine the parameters of the Linear Regression model. I updated the book to also mention the alternative you propose, which leads to using the MoorePenrose pseudoinverse of X. This in turn requires computing the SVD of X (see chapter 8 for the SVD). This solution is both faster to compute, and it supports collinear data (e.g., a dataset where one or more features are linear combinations of other features), which the normal equation does not support. This is what ScikitLearn actually uses. Thanks for your suggestion!

Niclas von Caprivi 
Aug 24, 2018 
Dec 07, 2018 
Printed 
Page 118, 119
Last line page 118. Label for figure 410 
The online code suggest that to be first 20 steps of SGD not first 10 steps
Note from the Author or Editor: You're right, it's the first 20 steps, not the first 10 steps. Fixed, thanks! :)

Calvin Huang 
Jan 05, 2018 
Oct 12, 2018 
Printed 
Page 118
First full paragraph 
You refer to the process of gradually reducing the learning rate as simulated annealing.
Other sources use this term to refer to an algorithm that occasionally makes "uphill" moves (with a probability decreasing over time).
I see the analogy, but I think you're using the terminology in a nonstandard way here.
Note from the Author or Editor: Thanks for your feedback. Indeed, it's an analogy, not an identity. I updated the sentence like so:
This process is akin to simulated annealing, an algorithm inspired from the process of annealing in metallurgy where molten metal is slowly cooled down.
For a more detailed explanation of the link between gradient descent using a learning schedule and simulated annealing, see:
http://leon.bottou.org/publications/pdf/nimes1991.pdf

Peter Drake 
Mar 09, 2018 
Oct 12, 2018 
Printed 
Page 125
Code 
When defining the polynomial_regression pipeline, there's a typo...it's currently Pipeline(( ....)) while it I believe it should be Pipeline([ ...]).
Great book. Thank you!
Michael
Note from the Author or Editor: Thanks Michael, indeed Pipelines take lists of tuples, not tuples of tuples. Previous versions of ScikitLearn would actually accept both, hence the fact that I did not catch this error earlier, but version 0.19 has become strict.

Michael Padilla 
Oct 20, 2017 
Nov 03, 2017 
Printed 
Page 134
Code snippet at the top 
An omission...in this code you refer to an ndarray called X_train_poly_scaled that isn't defined anywhere (thought it's easy to figure out what it should be). In your notebook it's defined naturally as
poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler()),
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
but this isn't given in the book although it's referred to. Not a biggie, but thought you should know. Thanks!
Note from the Author or Editor: Thanks for your feedback. Indeed, I often left some code details out of the book in order to keep it short and focused, but perhaps sometimes I went a bit too far. In this particular case, I think you are right that I should at least say that the data is extended with polynomial features and then scaled (or I should add the few lines of code that define X_train_poly_scaled and X_val_poly_scaled). Since the code example is meant to illustrate early stopping, I'd like to keep it focused so I think I'll go for the first option (a quick explanation in the text).
Thanks a lot!

Michael Padilla 
Oct 23, 2017 
Nov 03, 2017 
Mobi 
Page 137.1
second code block 
 >>> some_data_prepared = preparation_pipeline.transform(some_data)

+ >>> some_data_prepared = full_pipeline.transform(some_data)
Note from the Author or Editor: Thanks for your feedback. This error is now fixed.
Best regards,
Aurélien

Michael Ansel 
Jan 15, 2017 
Mar 10, 2017 
Printed 
Page 144
Fit 425 
The image is missing entirely  I have found the same problem in several places
P 139 fig 422
P 149 fig 54
P 224 fig 812
P. 296 fig 115
P 300 fig 116
First edition fourth release
Note from the Author or Editor: Thanks for your feedback. Yikes! This is bad, I'm really sorry about this. I have reported this problem to O'Reilly, I will get back to you ASAP.

Kelly McDonald 
Dec 25, 2017 
Jan 19, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 147
last sentence 
1st edition, 1st release
In a paragraph below Figure 54,
"...(using the LinearSVC class with C=0.1 and the hinge loss..."
should be
"...(using the LinearSVC class with C=1 and the hinge loss..."
Note from the Author or Editor: Good catch, it should indeed be C=1, not C=0.1. I fixed the book.
Thanks!
Aurélien

Haesun Park 
Jul 07, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 148
last line of first code sample 
svm_clf.fit(X_scaled)
should be
svm_clf.fit(X)
The pipeline is performing the scaling and there is no "X_scaled" variable elsewhere in the sample.
Note from the Author or Editor: Good catch, thanks. Indeed, it should be:
svm_clf.fit(X)
rather than:
svm_clf.fit(X_scaled)
I tested every code example in the book, but it seems that a few times I updated the notebooks and forgot to update the book. I just wrote a script to compare the code in the notebooks with the code examples in the book, and I'm currently going through every chapter to fix the little differences. This is one of them.

Adam Chelminski 
May 24, 2017 
Jun 09, 2017 
Printed 
Page 148
first set of code 
This code returns an error:
iris = datasets.load_iris()
X=iris["data"][:,(2,3)] #only petal length and width
y=(iris["target"]==2).astype(np.float64) #import only IrisVirginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1,loss="hinge")),
))
svm_clf.fit(X,y)
error:
~\Miniconda3\envs\MyEnv\lib\sitepackages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
224 # transformer. This is necessary when loading the transformer
225 # from the cache.
> 226 self.steps[step_idx] = (name, fitted_transformer)
227 if self._final_estimator is None:
228 return Xt, {}
TypeError: 'tuple' object does not support item assignment
Note from the Author or Editor: Thanks for your feedback.
The code actually works fine up to ScikitLearn 0.18, but then in ScikitLearn 0.19 (which did not exist when I wrote the book), Pipelines must now be created with a list of tuples instead of a tuple of tuples. I updated the Jupyter notebooks to ensure that the code now works with ScikitLearn 0.19. Basically, use this code instead (note the square brackets):
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1,loss="hinge")),
])
Cheers,
Aurélien

Justin 
Sep 21, 2017 
Nov 03, 2017 
Printed 
Page 148
Figure 51 
The image for figure 51 is missing.
Also figure 148 p397 the image is missing.
Can also confirm the missing figure images as reported by Kelly Dec. 25 2017.
First edition fourth release.
Note from the Author or Editor: Hi David,
Thanks a lot for your feedback, I am so sorry about these issues. I wish I had seen your message earlier, but I was in vacation, my apologies for the delay. I have reported the problem to O'Reilly, I will get back to you ASAP. I'm sure they will fix the problem very quickly.
Aurélien
Edit: the problem is now fixed. If you purchased the book via Amazon, O'Reilly told me that you can request a replacement copy, and it will be sent to you free of charge. Again, I'm really sorry about this problem, and I hope you are enjoying the book despite this issue.

David Thomas 
Jan 04, 2018 
Jan 19, 2018 
Printed 
Page 148
top of page 
Missing figures:
51
54
Also figures:
422
425
Note from the Author or Editor: Thanks for your feedback. This problem was due to a printer error in December 2017, and it was corrected in a reprint in January 2018; an O'Reilly representative will contact you for more details about the copy you have.

Edison de Queiroz Albuquerque 
Mar 26, 2018 
Apr 04, 2018 
Printed 
Page 151
Equation 51 
1st edition, 1st release
In equation 51,
I suggest
\phi_{\gamma} (x, l) = ... or \phi (x, l) = ...
is better than
\phi \gamma (x, l) = ...
Note from the Author or Editor: Good catch, thanks. I fixed this a few weeks ago, it should be okay in the next reprints.

Haesun Park 
Jul 10, 2017 
Aug 18, 2017 
PDF, Mobi 
Page 151
last paragraph 
In chapter 5, toy dataset moons is not introduced. First apparition of moons in the phrase
"Let’s test this on the moons dataset" yet no clarification before or after about what make_moons call makes.
Didn't check the code examples (maybe some doc string there) yet if you are not in the computer is difficult to follow.
As you enjoy the problems and the solutions, just adding something like
"The make_moon function creates a set of data points with the shape of two interleaving circles. Check sklearn documentation for more information."
could help a lot.
Cheers,
JJ.
Note from the Author or Editor: Thanks for your suggestion. Indeed, I just pointed to the figure 56 where the dataset is represented, but this was not enough. I added the following sentence:
The `make_moons()` function creates a toy dataset for binary classification: the data points are shaped as two interleaving half circles as you can see in figure 56.
Thanks again!
Aurélien

Joaquin Bogado 
Feb 08, 2018 
Oct 12, 2018 
Printed 
Page 151
Code at bottom 
Code at the bottom of the page should include the call to make_moons(); otherwise X and y will be fit from the iris dataset when working through the chapter in order.
Note from the Author or Editor: Good catch, thanks!
Indeed, the following line was missing in the code:
X, y = make_moons(n_samples=100, noise=0.15)
I just fixed this. Thanks again.

Anonymous 
Jul 18, 2019 

Printed 
Page 160
5th line of that page 
The value of vector b is supposed to be 1 I think. Since you introduced 1*t to A, if the vector of b is made of 1s then the whole formula after substitution will be
t(wx+b) >= 1, according to my calculation.
And I really think some part of the dot product and matrixvector multiplication is messed up. Is this convention in machine learning to use dot product represent matrix multiplication?
Note from the Author or Editor: Great catch! The vector b should be full of 1 instead of 1.
The constraints are defined as: p^T a^(i) <= b^(i), for i=1, 2, ..., m
If b^(i) = 1, we can rewrite the constraints as: p^T a^(i) <= 1
Since a^(i) = t^(i) x^(i), the constraints are: t^(i) p^T x^(i) <= 1
Which we can rewrite to: t^(i) p^T x^(i) >= 1
For positive instances, t^(i) = +1, and for negative instances t^(i) = 1.
So for positive instances: p^T x^(i) >= 1, which is what we want.
For negative instances: p^T x^(i) >= 1, therefore: p^T x^(i) <= 1, which is also what we want.
Thanks a lot for your feedback, I fixed the error for the next release.

Calvin Huang 
Jan 07, 2018 
Oct 12, 2018 
Printed 
Page 162
1st paragraph 
(1st edition, 5th release)
"The resulting vector p will contain the bias term b = p_0 and the feature weights w_i = p_i for i = 1, 2, ⋯, m"
But p is (n+1) dimensional vector not (m+1), so it should be "for i = 1, 2, ⋯, n"
Thanks.
Note from the Author or Editor: As always, you are right, Haesun, thanks a lot. Fixed (replaced m with n).

Haesun Park 
Jan 30, 2018 
Oct 12, 2018 
Printed 
Page 162
euqation 59 
I believe here the linear transformation and dot product is mistaken from a math perspective. It's supposed to be a dot b not a transpose dot b.
Note from the Author or Editor: Thanks for your feedback. Indeed, it's probably better to replace `a^T b` with `a.b` in this section. In many cases in Machine Learning, it's more convenient to represent vectors as column vectors (i.e., 2D arrays with a single column), so they can be transposed, used like matrices, and so on. Of course if `a` and `b` are column vectors, then `a^T b` is a 2D array containing a single cell whose value is equal to the dot product of the (1D) vectors corresponding to `a` and `b`. In other words, the result is identical, except for the dimensionality: if `a` and `b` are regular vectors, then `a.b` is a scalar, but if `a` and `b` are column vectors, then `a^T b` is onecell matrix. For example:
>>> import numpy as np
>>> np.array([2,3]).dot(np.array([5,7])) # a.b
31
>>> np.array([[2],[3]]).T.dot(np.array([[5],[7]])) # a^T b
array([[31]])
I plan to cleanup the whole book regarding this issue, not just chapter 5 (but it may take a bit of time).
Thanks again!

Calvin Huang 
Jan 06, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 164
Hinge Loss 
In the last sentence of Hinge Loss, "using any subderivative at t = 0" should be "using any subderivative at t = 1".
Note from the Author or Editor: Good catch! It should indeed be "any subderivative at t=1" instead of "any subderivative at t=0".
Thanks!

Hiroshi Arai 
May 30, 2017 
Jun 09, 2017 
Printed, Safari Books Online 
Page 166
Equation 512 
1st Edtion 5th Release.
In eq. 512, "1t^{(i)}\hat{w}^T" should be "t^{(i)}\hat{w}^T" like eq. 57.
Thanks.
Note from the Author or Editor: Great catch Haesun, thanks a lot. Indeed, the equation should contain t^{(i)}  \hat{w}^T (three times). Below is the corrected MathML code:
<math xmlns="http://www.w3.org/1998/Math/MathML" mode="display">
<mtable displaystyle="true">
<mtr>
<mtd columnalign="right">
<mover accent="true"><mi>b</mi> <mo>^</mo></mover>
</mtd>
<mtd columnalign="left">
<mrow>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo></mo>
<msup><mrow><mover accent="true"><mi mathvariant="bold">w</mi> <mo>^</mo></mover></mrow> <mi>T</mi> </msup>
<mo>·</mo>
<mi>ϕ</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mfenced>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo></mo>
<msup><mrow><mfenced separators="" open="(" close=")"><munderover><mo>∑</mo> <mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow> <mi>m</mi> </munderover><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><msup><mi>t</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mi>ϕ</mi><mrow><mo>(</mo><msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mo>)</mo></mrow></mfenced></mrow> <mi>T</mi> </msup>
<mo>·</mo>
<mi>ϕ</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mfenced>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd/>
<mtd columnalign="left">
<mrow>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo></mo>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mrow>
<msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<msup><mi>t</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<mi>K</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>,</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mrow>
</mfenced>
</mrow>
</mtd>
</mtr>
</mtable>
</math>

Haesun Park 
Mar 01, 2018 
Oct 12, 2018 
Printed 
Page 172
Paragraph right before sectoin Computational Complexity 
It seems that the following paragraph should be part of the caution (scorpion) section which discusses greed algorithms and its reasoning, instead of part of the main text:
"Unfortunately, finding the optimal tree is known to be an NPComplete problem:2 it requires O(exp(m)) time, making the problem intractable even for fairly small training sets. This is why we must settle for a “reasonably good” solution."
Note from the Author or Editor: That's a good point: I moved this sentence into the caution section.
Thanks for your feedback,
Aurélien

Jiaqi Liu 
Apr 15, 2018 
Oct 12, 2018 
PDF, Safari Books Online 
Page 175
Eq. 63 
In Eq. 63, natural logarithm is used, but entropy for information gain use binary logarithm. scikitlearn does too(https://github.com/scikitlearn/scikitlearn/blob/master/sklearn/tree/_utils.pyx#L86).
So, I recommend you to change eq. 63 to log_2(..) not log(..), and fig 61's entropy calculation to 0.445 not 0.31.
Thanks.
Note from the Author or Editor: Good point, I just fixed this mistake, thanks a lot!
Note that it does not change the resulting tree, since the value of x that maximizes a function f(x) also maximizes f(x)/log(2) (where "log" denotes the natural logarithm).
Entropy originated in thermodynamics, where the natural log is used. It later spread to other domains, including Shannon's information theory, where the binary log is used, and therefore the entropy can be expressed as a number of bits. In TensorFlow, the softmax_cross_entropy_with_logits() function uses the natural log rather than the binary log. Its value is just used for optimization (the optimizer tries to minimize it), so it does not matter whether they use the binary log or the natural log. If you wanted to get a number of bits, you would have to divide the result by log(2).
By the way, if you are interested, I did a video about entropy, crossentropy and KLdivergence: https://youtu.be/ErfnhcEV1O8
Thanks again,
Aurélien

Haesun Park 
Mar 25, 2018 
Oct 12, 2018 
Printed 
Page 188
the paragraph before the Random Patches and Random Subspaces 
Original phrase
Page 188: Chapter 7: Ensemble Learning and Random Forests
"has a 60.6% probability of belonging to the positive class (and 39.4% of belonging to the positive class):"
There are the word "positive class" two times. If 39.4% is the probability to be in the positive class, I think 100  39.4% which is 60.6 should be the probability to be in the negative class.
Which number is for negative and which one is for positive class, then? Please help, thank you.
Note from the Author or Editor: Good catch, thanks! Indeed, the sentence should be:
"""
For example, the oob evaluation estimates that the first training instance has a 68.25% probability of belonging to the positive class (and 31.75% of belonging to the negative class):
"""

Ekarit Panacharoensawad 
Jul 13, 2017 
Aug 18, 2017 
Printed 
Page 193
Figure 78 
1st edition,
In figure 78, titles are learning_rate = 0 and learning_rate = 0.5
I think that learning_rate = 1 and learning_rate = 0.5
Why you use learning_rate  1 for title?
Thanks
Note from the Author or Editor: Nice catch, that's indeed a mistake. I just fixed it, future reprints and digital editions will be better thanks to you! :)

Haesun Park 
Sep 01, 2017 
Nov 03, 2017 
Printed, ePub 
Page 193
Equation 71 
In the definition of r_j, the denominator is given as the sum of the weights, but this sum is always 1. The weights are initialized so they sum to one (just before equation 71), and then normalized again after any update (just below equation 73) so they again sum to one.
Note from the Author or Editor: Thanks for your feedback. Indeed, the denominator is always equal to 1, so I could remove it in Equation 71. I remember hesitating to do so, but I chose not to because I wanted to show that r_j represents the weighted error rate, and when people read "rate", I think they except a numerator and a denominator. However, I think I will add a note saying that the denominator is always equal to 1.
Cheers,
Aurélien

Glenn Bruns 
Sep 16, 2017 
Nov 03, 2017 
Printed 
Page 213
Equation 81 
(1st Edition)
In Equation 81, V^T should be V.
This is often confused, because svd() function actually returns V^T, not V
So, I suggest to change code below Eq 81
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
>
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]
In next page(p314) first sentence,
"the maxtix composed of the first d columns of V^T"
should be
"the maxtix composed of the first d columns of V"
Thanks
Note from the Author or Editor: Good catch, thanks!
Here is the list of changes I just made, which includes your list plus a couple more fixes:
* Top of page 213, "where V^T contains all the principal components" was changed to "where V contains all the principal components".
* Equation 81: replaced V^T with V.
* In all code examples, replace V with Vt. This includes 3 replacements in the code on page 213, and 1 replacement in the first code example on page 214.
* Top of page 214: "the matrix composed of the first d columns of V^T" was changed to "the matrix composed of the first d columns of V".
I also updated the corresponding notebook, and added a comment to explain the issue.
Thanks again! :)

Haesun Park 
Sep 14, 2017 
Nov 03, 2017 
Printed 
Page 223
Multiple sentences 
(1st edition)
I think w_{i,j} is not unit vector, so \hat does not need.
Also, LLE equation can be presented by l2 norm square, but just absolute square is more common.
Thanks.
Note from the Author or Editor: Interesting question. The \hat in this context indicates that the weights are the result of a first optimization (that of Equation 84). It does not mean that we are talking about a unit vector. So I would rather leave them in place on this page because I think it helps understand which parts of Equation 85 are constant (i.e., the weights \hat{w}_{i,j}) and which parts are not (i.e., the positions of the instances in the lowdimensional space, z^(i)).
However, I agree that the l2 norm is unnecessary since we are computing the square, and of course v^2 is the same as v^2. I'll replace the double vertical lines () with parentheses.
Thanks!

Haesun Park 
Oct 07, 2017 
Nov 03, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 235
2nd paragraph 
You have "y depends on w, which depends on x".
I believe y depends on x, which depends on w.
Note from the Author or Editor: Good catch, thanks again Peter. :)
Indeed the sentence should read:
TensorFlow automatically detects that y depends on x, which depends on w, so it first evaluates w, then x, then y, and returns the value of y.

Peter Drake 
May 25, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 236
2nd paragraph 
In the Normal Equation, the left parenthesis to the left of the theta should be moved to right of the equal sign.
Note from the Author or Editor: Thanks for your feedback!
Indeed, there's a problem with the parentheses in this sentence. However the problem is actually that there is an opening parenthesis missing on the right hand side of the = sign. The text should look like this (except with nice math formatting):
[...] corresponds to the Normal Equation (theta_hat = (XT . X)1 . XT . y; see Chapter 4).
I fixed the error and pushed it to production (the digital versions should be updated within a couple weeks).
Thanks again!

Peter Drake 
May 25, 2017 
Jun 09, 2017 
Printed 
Page 237
code block at bottom 
In the code block at the bottom of page 237, you use tf.reduce_mean without having explained what that function does. It was easy enough to look up in the TensorFlow documentation, but it would have been helpful to have the explanation in the text, and I assume you intended to explain reduce_mean in the list of brief explanations of newlyintroduced functions (e.g. tf.assign) just above the code block.
Thanks.
Note from the Author or Editor: Thanks for your feedback. I initially thought that this function would be selfexplanatory, given its name and the fact that it is used to compute the mean of the squared error, but I agree that the name can actually be confusing: it is somewhat unfortunate that they didn't just name it "mean()" instead of "reduce_mean()", as it's really analogous to NumPy's mean() function. To clarify this, I added the following line:

* The `reduce_mean()` function creates a node in the graph that will compute the mean of its input tensor, just like NumPy's `mean()` function.

I hope this helps.

Jeff Lerman 
Jan 09, 2018 
Oct 12, 2018 
Printed 
Page 245
last sentence 
1st edition.
In last sentence, "inside the loss namespace, ..." should be "inside the loss namescope, ..."
Note from the Author or Editor: Thanks a lot, it's a typo. I fixed it now. :)

Haesun Park 
Oct 04, 2017 
Nov 03, 2017 
PDF 
Page 246
2nd paragraph and code example 
tf.global_variable_initializers()
should be
tf.global_variables_initializer()
Note from the Author or Editor: Great catch, thanks! This error is now fixed, it was a failed find&replace, when the method `initialize_all_variables()` got renamed to `global_variables_initializer()`.
Best regards,
Aurélien

ken bame 
Feb 26, 2017 
Mar 10, 2017 
Other Digital Version 
248
4 
"Zeta is the 8th letter of the Greek alphabet"
It is the 6th letter of the Greek alphabet.
Note from the Author or Editor: Indeed, Zeta is the 6th letter of the Greek alphabet, thanks!

Oliver Dozsa 
Oct 26, 2017 
Nov 03, 2017 
Printed 
Page 252
Excercise 12. fourth bullet 
(1st edition)
In chapter 9 ex. 12, fourth bullet says "... using nice scopes...".
I think it's typo of "... using name scopes...".
Note from the Author or Editor: Indeed, this sentence should read "name scopes" instead of "nice scopes". Thanks!

Haesun Park 
Oct 07, 2017 
Nov 03, 2017 
Printed 
Page 257
The perceptron paragraph 
In neural network literature, the artificial neuron in perceptron model is usually called Threshold Logic Unit (TLU). TLU is more common than LTU.
Note from the Author or Editor: Thanks for your feedback, indeed it seems that TLU is more common than LTU.
I tried to use "googlefight.com" to settle the dispute, but it failed, so I did a manual check:
* Google search for "threshold logic unit": 21,400 results.
* Google search for "linear threshold unit": 7,890 results.
So TLU wins hands down! :)
I also searched on Google's ngram viewer, and a few references to the TLU have been seen in various books, while there was no reference to LTU.
So I updated chapter 10 and the index to use Threshold Logic Unit rather than Linear Threshold Unit.
Thanks again,
Aurélien

Anonymous 
Mar 09, 2018 
Oct 12, 2018 
Printed 
Page 260
graph 
Is All the weight on the graph? I can't find myself understand what you mean by the graph.
Note from the Author or Editor: Thanks for your question. The numbers on Figure 106 represent the connection weights. For example, if the network gets (0,0) as input (so x1=0 and x2=0), then neuron in the middle of the hidden layer will compute 1.5*1 + 1*x1 + 1*x2 = 1.5, which is negative so it will output 0. The neuron on the right of the hidden layer will compute 0.5 * 1 + 1 * x1 + 1 * x2 = 0.5, which is negative so it will also output 0. Finally, the output neuron at the top will compute 0.5*1 + 1*0 + 1*0 = 0.5, so the final output of the network will be 0. Indeed, 0 XOR 0 = 0, so far so good.
If we try again with inputs (1, 1), we get the following computations (considering the neurons in the same order):
1.5*1 + 1*1 + 1*1 = 0.5 => output 1
0.5*1 + 1*1 + 1*1 = 1.5 => output 1
0.5*1  1*1 + 1*1 = 0.5 => final output 0
Again, this is good because 1 XOR 1 = 0.
If we try again with inputs (0, 1), we get the following computations (again, considering the neurons in the same order):
1.5*1 + 1*0 + 1*1 = 0.5 => output 0
0.5*1 + 1*0 + 1*1 = 0.5 => output 1
0.5*1  1*0 + 1*1 = 0.5 => final output 1
Great, that's what we wanted: 0 XOR 1 = 1.
Lastly, we can try again with inputs (1, 0), and we get the following computations:
1.5*1 + 1*1 + 1*0 = 0.5 => output 0
0.5*1 + 1*1 + 1*0 = 0.5 => output 1
0.5*1  1*0 + 1*1 = 0.5 => final output 1
Again, that's what we wanted: 1 XOR 0 = 1.
So this network does indeed solve the XOR problem, using the weights indicated on the diagram. I'll add a note to clarify the fact that the numbers on the diagram represent the connection weights.
I hope this is clearer.
Cheers,
Aurélien

calvin huang 
Jan 13, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 263
line 5 
The sentence 'The softmax function was introduced in Chapter 3." is incorrect; the softmax function was introduced in Chapter 4 (p. 139 of the print edition).
Note from the Author or Editor: Good catch! Indeed, the softmax function was introduced in chapter 4, not 3 (in my first draft, it was introduced in chapter 3, hence the mistake).
Thanks a lot!
Aurélien

Glenn Bruns 
Jul 05, 2017 
Aug 18, 2017 
Printed 
Page 264
3rd code paragraph 
Code example:
>>>dnn_clf.evaluate(X_test,y_test)
Doesn't supported
should be
>>>dnn_clf.score(X_test,y_test)
instead.
Note from the Author or Editor: Thanks for your feedback. The code works fine in TensorFlow 1.0, but it breaks in TensorFlow 1.1, because TF.Learn's API was changed significantly. I noticed this a while ago and I updated the book accordingly (I removed the paragraph about evalution because TF.Learn seems to be a moving target), so this problem only affects people who have the first revision of the book and are using TF 1.1+.
Cheers,
Aurélien

Yevgeniy Davletshin 
Jul 05, 2017 
Aug 18, 2017 
Printed 
Page 266
middle of the page, and the first line of the code 
In p.266, the std of 2/sqrt(n_input) is used to help the algorithm converge faster.
However, from the explanation in chapter 11 (p.278), it seems like it is only true when n_input and n_output are roughly same and the activation function is Hyperbolic tangent.
Note from the Author or Editor: Great catch, thanks. I should have written 2/sqrt(n_inputs+n_neurons) or sqrt(2/n_inputs). This is He Initialization, to be used with the ReLU activation function (the latter would be okay when n_inputs is equal or close to n_outputs). In practice, for shallow networks (such as the ones in chapter 10) it's not a big deal if initialization is not perfect. It's much more important for deep nets.
I'll fix chapter 10, thanks again for your contribution!

Joshua Min 
Aug 16, 2017 
Nov 03, 2017 
Printed 
Page 268
Note 
(First Edition)
In Note, "... corner case like logits equal to 0."
I think that corner case softmax's output equal to 0 or logits far less than 0.
In cross entropy p*log(q), as you may know, q is softmax's output.
Note from the Author or Editor: Good catch! I replaced this sentence with this: "[...] and it properly takes care of corner cases: when logits are large, floating point rounding errors may cause the softmax output to be exactly equal to 0 or 1, and in this case the cross entropy equation would contain a log(0) term, equal to negative infinity. The `sparse_softmax_cross_entropy_with_logits()` function solves this problem by adding a tiny epsilon value to the softmax output.".
Thanks Haesun!

Haesun Park 
Oct 21, 2017 
Nov 03, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 269
2nd paragraph 
"one minibatches" should be "one minibatch"
Note from the Author or Editor: Good catch, thanks. I fixed the error.

Peter Drake 
May 25, 2017 
Jun 09, 2017 
Printed 
Page 269
Last paragraph 
"...the code evaluates the model on the last minibatch and on the full training set, and..."
should read
"...the code evaluates the model on the last minibatch and on the full test set, and..."

Adam Chelminski 
May 31, 2017 
Jun 09, 2017 
Printed 
Page 269
last code block 
(First Edition)
In execution phase, training loop uses mnist.test data.
As you may know, it's not good practice.
I suggest to change it to mnist.validation for most readers and evaluate test set after forepoch loop.
Best,
Haesun. :)
Note from the Author or Editor: Thanks for your feedback. For a second I thought you were saying that I trained the model on the test set! :)
The training loop uses mnist.train for training, and shows the progress by evaluating the model on the test set. I agree with you that it would be better to use the validation set for this purpose. I'm updating the notebook and the book.

Haesun Park 
Oct 26, 2017 
Nov 03, 2017 
Printed, PDF, ePub 
Page 278
Table 111 
You are introducing the initialization scheme, Xavier and He's, for both of Uniform[r, r], and Normal(0, sigma^2).
1. I think the order of listing is reversed between logistic and tanh.
2. This is a minor issue of typeset, but it keeps confusing me that,
the number '4' in front of the initialization factors of 'Hyperbolic Tangent' (current unfixed version) looks like the fourth root. Could you increase its size a little bit more in next revision?
3. This is a question as a new beginner in this field.
He, et al. commented in their paper [arXiv:1502.01852] that
"We note that it is sufficient to use either Eqn.(14) or Eqn.(10) alone. For example, if we use Eqn.(14), then the product in Eqn.(13), the product (...)=1, and in Eqn.(9) the product (...) =c2/dL , which is not a diminishing number in common network designs. This means that if the initialization properly scales the backward signal, then this is also the case for the forward signal; and vice versa."
My question here is on the compromised version of your table 111 for ReLU. Instead of using just either n_in or n_out, what benefit does (n_in+n_out)/2 give me? Numbers in the original version exactly cancels along the whole tower of layers, and gives an exactly stable (fixed) variance of gradients (or inputs, depending on the choice bet. n_in and n_out).
I think this makes a big difference when the size of layers changes a lot. I am just a beginner so I have no idea how many times I encounter such cases in real problems. How is the geometric mean as an alternative?
Cheers,
Note from the Author or Editor: Thanks for your feedback!
1. You are right, I inverted the equations for Logistic and Hyperbolic Tangent, I just fixed this. Great catch!
2. I'm not sure how to increase the size of the font of the number 4, but I added a small space between it and the square root, hopefully it will avoid confusion.
3. That's a good question, I'm not sure whether using (n_in+n_out)/2 or just n_in or n_out is preferable. My intuition is that the former is better, but I don't have data to back that up, it would be interesting to run some experiments. I might try that when I get the chance.

Doyoun Kim 
Nov 12, 2018 
Dec 07, 2018 
PDF 
Page 279
2nd Paragraph 
Minor grammar issue that you might want to fix in the 2nd paragraph ('Nonsaturating Activation Functions'.)
.., it will start outputting 0. When this happen, the neuron ...
should be
.., it will start outputting 0. When this happens, the neuron ...
Thanks for a thoroughly enjoyable and informative book!
Note from the Author or Editor: Nice catch, thanks! I just fixed this, future reprints and digital editions should be fine.

Vineet Bansal 
Aug 21, 2017 
Nov 03, 2017 
Printed 
Page 281
Book 2nd release, 3rd list bullet 
The assertion
"the function is smooth everywhere, including around z = 0"
is only true if alpha = 1.
Note from the Author or Editor: Good point, you are absolutely right. I corrected this sentence like this:
Third, if alpha is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.
Thanks for your feedback!

Paolo Baronti 
Jan 10, 2018 
Oct 12, 2018 
Printed 
Page 288
the code for reusing variables 
In "reuse_vars_dict", var.name was repeated twice instead of (var.op.name, var) as was shown in the jupyter notebook, but more importantly I think this line is redundant since feeding the saver with "reuse_vars" will lead to the same result: the new model will use of the variables in the hidden layers 13 under their old names.
Note from the Author or Editor: Thanks for your feedback! I actually fixed this error a few months ago, so the latest releases contain (var.op.name, var) instead of (var.name, var.name). However, I did not realize that I could just get rid of this line, that's nice! I just did, both in the book and in the Jupyter notebook.
Thanks again!
Aurélien

Anonymous 
Mar 21, 2018 
Oct 12, 2018 
Printed 
Page 295
Equation 115 
I think the equation description of nestrov accelerated gradient is not correct.
Shortly speaking, the sign of eq. 1 \theta+\beta m inside of gradient is wrong.
Long version:
Under strongly convexity assumption, the Nestrov acceleration can be viewed as the incremental version of momentum acceleration. If puttin the 1 and 2 equation in the book together, you will get:
\theta = \theta  \beta m  \eta \nabla J (\theta + \beta m)
Noticing the mismatch between (\theta  \beta m) and (\theta + \beta m) in the gradient.
Because according to the author notation, m is the accumulated estimation of gradient (Not NEGATIVE gradient), therefore the true gradient estimated should be at \theta  \beta m. Thus, in my opinion, the correct equation should be:
1. m < \beta m + \eta \nabla J(\theta  \beta m)
2. \theta < \theta  m
Hope this is helpful.
Note from the Author or Editor: Good catch, thanks. Indeed, I flipped the signs, so the steps should be:
1. m := beta * m  eta * gradient_at(theta + beta * m)
2. theta := theta + m
Latexmath:
\begin{split}
1. \quad & \mathbf{m} \gets \beta \mathbf{m}  \eta \nabla_\mathbf{\theta}J(\mathbf{\theta} + \beta \mathbf{m}) \\
2. \quad & \mathbf{\theta} \gets \mathbf{\theta} + \mathbf{m}
\end{split}
I often see m interpreted as the negative gradient, in which case the equations would be the following (that's what I was aiming for):
1. m < beta * m + eta * gradient_at(theta  beta * m)
2. theta < theta  m
However, I double checked: the figures and the text do not assume that m is the negative momentum, so I fixed the book as you suggested (and I also flipped the signs in the momentum optimization equations for consistency).
Thanks again, I very much appreciate your help,
Aurélien

Bicheng Ying 
Jun 12, 2017 
Aug 18, 2017 
Printed 
Page 298
RMSProp section 
rmsprop optimizer has the momentum=0.9 argument, however a momentum term is not included in equations 11.7
Note from the Author or Editor: Thanks for your feedback. Indeed, the "raw" RMSProp algorithm, as presented on slide 29 of Geoffrey Hinton's 6th Coursera lecture (https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) does not use momentum, so that's what I presented, but indeed TensorFlow's implementation does add the option to combine it with momentum optimization (regular, not Nesterov). This was suggested by Hinton on slide 30 ("Further developments of rmsprop").
I will clarify this for the next releases, thanks again for your feedback.
Cheers,
Aurélien

Anonymous 
Mar 27, 2018 
Oct 12, 2018 
Printed 
Page 299
Equation 118 Adam algorithm 
In step 3 & step 4 of Adam algorithm in the book, the term 'm' and 's' are updated.
According to the original paper of Adamalgorithm(https://arxiv.org/abs/1412.6980), they should not be updated in iterations. The unbiased version of 'm' and 's' should only be used to calculate theta in next generation.
Note from the Author or Editor: Great catch Zhao! Indeed, I forgot the hats in steps 3, 4 and 5:
3. \hat{m) < m / (1  {\beta_1} ^ t)
4. \hat{s) < s / (1  {\beta_2} ^ t)
5. \theta < \theta + \eta \hat{m} \oslash \sqrt{\hat{s} + \epsilon}
Thanks again,
Aurélien

Zhao yuhang 
Feb 25, 2018 
Oct 12, 2018 
Printed 
Page 300
Figure 116 
Your example how the Nesterov update is converging faster is wrong in my opinion. The example is a function with one variable (teta). In consequence, the gradient (and the momentum, too) is onedimensional in each point. But you draw them as 2dim tangential vectors, which leads in your wrong assumption that the Nesterov update is going closer to the optimum in this example.
Why is your assumption wrong:
You can clearly see in the graph that
 eta * gradient1 >  eta * gradient2 > 0
and
beta * m > 0. (looking at the xcomponent)
This leads to
beta * m  eta gradient1 > beta * m  eta gradient2 > 0
and
beta * m  eta gradient1 > beta * m  eta gradient2 > 0
which is a clear contradiction to your drawing.
There are real examples when the Nesterov update is better than the regular momentum update:
 It is crossing a local minimum / stationary point faster
 If the regular momentum update goes farther than the optimum, the Nesterov update does not go as far away from the optimum (in some situations).
#stilllovingyourbook
Note from the Author or Editor: Excellent catch, thanks! I tried to fix the figure while keeping the cost function 1D, but it looked bad, and it didn't make Nesterov Accelerated Gradient seem very useful at all, so I ended up changing the figure altogether to make the cost function 2D. Hopefully it should be in the tenth release of the 1st edition (which should come out very shortly, in December 2018).
Thanks again!

Niclas von Caprivi 
Sep 03, 2018 
Dec 07, 2018 
PDF 
Page 305
First line of last paragraph 
Beginning phrase of the second sentence in the last paragraph says: Suppose p = 50, ....
Since P is a probability with value from 0 to 1, It would be nice to explicitly state it as p = 50 % or 0.5 so as to avoid ambiguity
Note from the Author or Editor: You are right, there's a % sign missing, it should read "suppose p = 50%".
Thanks!

Denis Oyaro 
May 27, 2017 
Jun 09, 2017 
PDF 
Page 313
1st paragraph 
For further callback method names, looks like "on_epoch_begin()" is there twice, but no "..._end". Same for "on_batch_end()" where there's no "_begin". A copy/paste mixup?
Note from the Author or Editor: Great catch, thanks!
The sentence should be:
As you might expect, you can implement `on_train_begin()`, `on_train_end()`, `on_epoch_begin()`, `on_epoch_end()`, `on_batch_begin()` and `on_batch_end()`.

Gregory Deal 
Jun 12, 2019 

Printed 
Page 322
Figure 125 
(1st Edition)
In Fig 125, Both CPU and GPU has interop and intraop.
But AFAIK, interop and intraop is for CPU.
Refer to https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu and https://stackoverflow.com/questions/41233635/tensorflowinterandintraopparallelismconfiguration
Please check this again.
Thank you.
Note from the Author or Editor: Thanks for your feedback. This is a great question! I wasn't quite sure about this when I was writing this chapter, so I asked the TensorFlow team, and here is the answer I got from one the team leads:
"""
[...]in my experience parallelism isn't very significant to GPU ops, since most of the acceleration is achieved under the hood with libraries like cudnn that do intraop parallelism automatically, and [...] tend to take over the whole GPU.
As far as your diagram goes, I believe that we might support running multiple GPU threads through separate executor streams via StreamExecutor, but it's generally not a good idea from a performance point of view.
"""
So, my understanding was that, on the GPU, the intraop thread pool exists, although it is managed by libraries such as cuDNN rather than by TensorFlow itself: I decided that this was an implementation detail (after all, TensorFlow is based on cuDNN), so I included the intraop thread pool on the diagram, but it's true that it is not a configurable thread pool, contrary to the CPU interop thread pool.
Since TensorFlow must run operations in the proper order when there are dependencies, and since it manages execution using an interop thread pool for the CPU, I assumed that it must be the case as well for GPUs.
However... reading your question, it got me thinking about this some more, and I realized I could actually simply run a test. The conclusion is that I was wrong: there does NOT seem to be an interop thread pool for GPUs: TensorFlow just decides on a particular order of execution (deterministically, based on the dependency graph), then it runs the operations sequentially (however each operation may have a multithreaded implementation).
So I will update this diagram and the corresponding paragraph.
I don't think it's a severe error, because it won't change much for users in terms of code, but it's a very useful clarification to avoid confusion.
I published the code of my experiment in this gist, in case you are interested:
https://gist.github.com/ageron/b378479efdf7e501bd270d032000fcc1
Thanks a lot!
Cheers,
Aurélien

Haesun Park 
Dec 04, 2017 
Jan 19, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 328
Code section under Pinning Operations Across Tasks 
Missing colon in the with statement below
with tf.device("/job:ps/task:0/cpu:0")
a = tf.constant(1.0)
with tf.device("/job:worker/task:0/gpu:1")
b = a + 2
Note from the Author or Editor: Good catch! Thanks. Indeed, the code sample should look like this:
with tf.device("/job:ps/task:0/cpu:0"):
a = tf.constant(1.0)
with tf.device("/job:worker/task:0/gpu:1"):
b = a + 2
c = a + b
Thanks a lot,
Aurélien

Hei 
Jun 15, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 333
Equation 102. Perceptron learning rule 
Page number of error is not exact since I have the kindle (azw) version.
The error is at Chapter 10. Perceptron learning rule of Equation 102.
W(next step) = W + eta(y_hat  y)x # (estimation  true_label)
should be
W(next step) = W + eta(y  y_hat)x # (true_label  estimation)
Note from the Author or Editor: Good catch, indeed this is a mistake. Equation 102 should have target  estimation rather than estimation  target. In latex math, the equation should be:
{w_{i,j}}^{(\text{next step})} = w_{i,j} + \eta (y_j  \hat{y}_j) x_i
rather than:
{w_{i,j}}^{(\text{next step})} = w_{i,j} + \eta (\hat{y}_j  y_j) x_i
Thank you!

Lee, Hyun Bong 
Apr 30, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 333
Code section at the top 
In the second line of the code, it should call q.enqueue_many() instead of q.enqueue()
So, that line should be:
enqueue_many = q.enqueue_many([training_instances])
Note from the Author or Editor: Good catch! The text says you can use "enqueue_many" and then I use "enqueue" in the code example, I was probably out of coffee. ;) That line of code should be:
enqueue_many = q.enqueue_many([training_instances])
Thanks a lot,
Aurélien

Hei 
Jun 19, 2017 
Aug 18, 2017 
PDF 
Page 337
Above 'Closing a queue' 
1st edition, 5th release.
In code block above 'Closing a queue',
dequeue_a, dequeue_b should be dequeue_as, dequeue_bs.
Thanks
Note from the Author or Editor: Good catch! That's a typical copy/paste error, sorry about that. Indeed, it should be dequeue_as and dequeue_bs, instead of dequeue_a and dequeue_b. Thanks a lot.

Haesun Park 
Feb 08, 2018 
Oct 12, 2018 
PDF 
Page 342
2nd paragraph from bottom 
In the 2nd edition, p. 342, you mention "shear luck". I think this should be "sheer luck", unless sheep have some effect I never heard of!
Note from the Author or Editor: Haha, good catch! :)
It should indeed be "sheer luck".
Cheers!

Gregory Deal 
Jul 04, 2019 

Safari Books Online 
360
Exercises 82 
Thanks for this excellent book.
I am interested in particular in distributing TensorFlow. Unfortunately, there is no solution online for exercises 810 of chapter 12.
Do you plan to complete the corresponding notebook?
Thanks,
Giovanni
Note from the Author or Editor: Thanks for your feedback. Yes, sorry about that, exercise solutions took me way more time than I initially planned, and this chapter was a bit tricky because it required getting the user to set up various infrastructures (TF Serving, GCP, TF cluster, and so on). I chose to focus on the other chapters first, and never reached this one.
However, I recently answered a question about this topic on github:
Please take a look at my TF2 course notebooks at https://github.com/ageron/tf2_course
In particular 03_loading_and_preprocessing_data.ipynb and 04_deploy_and_distribute_tf2.ipynb.
There are two main scenarios when you go to the cloud:
* Running: you have already trained the model locally and you just want to run a web service that executes it.
* Training: you want to train your model at a large scale on the Cloud.
Running a trained model on GCP is not too hard. First, learn to deploy on TF Serving (as shown in the notebook), then basically you can use GCP as a hosted TF Serving.
For training (e.g., on TPU), check out this Colab notebook:
https://colab.research.google.com/github/GoogleCloudPlatform/trainingdataanalyst/blob/master/courses/fastandleandatascience/01_MNIST_TPU_Keras.ipynb
Hope this helps,
Aurélien

Giovanni 
Feb 02, 2019 
Mar 08, 2019 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 361
Undder # Create 2 filters comment 
he means to define the line filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32) , but he calls the variable filters_test, like the two lines below it . The jupyter notebook doesn't make that mistake, though
Note from the Author or Editor: Good catch, thanks! I probably renamed the variable at one point and missed a few occurrences, sorry about that.
This is now fixed, but it may take some time to propagate to production.

Joseph Vero 
Apr 30, 2017 
Jun 09, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 361
Ch 13, Paragraph after Fig 136 
Text says

Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sw to i × sw + fw – 1 and columns j × sh to j × sh + fh – 1, across all feature maps (in layer l – 1).
Concern

The book defines sw as horizontal stride, and sh as vertical stride. Cool.
My intuition is that the horizontal stride changes the feature map's number of columns. And vertical stride changes the the feature map's number of rows.
Should it be:
a) the horizontal stride sw (not sh) should affect the column ranges?
b) the vertical stride sh (not sw) should affect the row ranges?
Correction

Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sh to i × sh + fw – 1 and columns j × sw to j × sw + fh – 1, across all feature maps (in layer l – 1).
Please forgive me if I'm wrong. Just plodding through the book and doing 'back of the envelope' calculations/exercises as I go.
Regards,
dre
Note from the Author or Editor: Good catch, this is indeed an error, my apologies. Moreover, it helped me find an error in Equation 131. I doublechecked the rest of pages 357361 and they seem fine to me.
The sentence at the bottom of page 361 should be:
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l  1, located in rows i x sh to i x sh + fh  1 and columns j x sw to j x sw + fw  1, across all feature maps (in layer l  1).
The Equation 131 should be (using latexmath):
z_{i,j,k} = b_k + \sum\limits_{u = 0}^{f_h  1} \, \, \sum\limits_{v = 0}^{f_w  1} \, \, \sum\limits_{k' = 0}^{f_{n'}  1} \, \, x_{i', j', k'} . w_{u, v, k', k}
\quad \text{with }
\begin{cases}
i' = i \times s_h + u \\
j' = j \times s_w + v
\end{cases}
The difference is that u, v and k' must be zeroindexed, and i'=i x sh + u instead of i'=u x sh + fh  1, and similarly j' = j x sw + v instead of j' = v x sw + fw  1.
You can view the updated equation (and all equations in the book) at:
http://nbviewer.jupyter.org/github/ageron/handsonml/blob/master/book_equations.ipynb
Thank you very much for your help,
Aurélien Géron

andre trosky 
Jun 22, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 362
1st para after Tensorflow Implementation title 
Text

The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn, fn′].
Concern

On the same page above (p263), fn' is defined as the number of feature maps in the previous (l1) convolutional layer. Let's assume then that fn is the number of features in the l convolutional layer.
The Tensorflow API for tf.nn.conv2d has the 'filter' parameter defined as
[filter_height, filter_width, in_channels, out_channels].
Which using your current nomenclature means the text should read:
Correction

The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn', fn].
Additional

The TF implementation code on p363 defines the variable named 'filters' as:
[...]
filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32)
[...]
Meaning it does adhere to the TF API for tf.nn.conv2d.
Note from the Author or Editor: Good catch! Yes indeed, it should read:
The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn', fn].
Thank you!
Aurélien

andre trosky 
Jun 23, 2017 
Aug 18, 2017 
Printed, Safari Books Online 
Page 365
TIP below Figure 139 
In TIP box below Fig 139,
I think that stacking two 3 x 3 kernels has same effect as a 5 x 5 kernel not 9 x 9 kernel.
Two 3 x 3 conv have a 5 x 5 effective receptive field.
Thanks.
Note from the Author or Editor: Great catch, thanks! I fixed the tip like so:
A common mistake is to use convolution kernels that are too large. For example, instead of using a convolutional layer with a 5 × 5 kernel, it is generally preferable to stack two layers with 3 × 3 kernels: it will use less parameters and compute, and usually perform better.

Haesun Park 
Aug 30, 2018 
Dec 07, 2018 
Printed 
Page 366
1st paragraph 
On the fourth line, the sentence says "it also create the bias variable (named bias) and initializes it with zeros". I believe the word "create" should be changed to "creates" adding an "s".
Note from the Author or Editor: Nice catch, thanks! It should indeed say "It also creates the bias variable" rather than "It also create the bias variable".
I just fixed the error.
Cheers,
Aurélien

Zoe Wexler 
Apr 25, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 383
Ch 14 equation 141 Output of a single recurrant neuron for a single instance 
Eq 141 implies:

The value y(t) is a vector quantity.
Concern

The way I understand Figure 142 is that each neuron in the recurrant layer outputs a single scalar value per timestep, and these scalars make up the vector quantity y.
Specifically, each element of the vector y comes from only one of the neuron's output in the recurrant layer.
But single neuron equation 141 implies that y(t)) is a vector quantity.
Dimensional analysis of Eq 141 requires the value of y(t) to be a scalar if:
1. bias b is a scalar and
2. x(t) and w_x and y_t1 and w_y are vectors
Eq 141 Correction

y(t) should not be bold face, implying that it's a scalar quantity specific to one neuron.
Note from the Author or Editor: Once again, good catch! My intention was actually to show the equation for a whole recurrent layer on a single instance (i.e., on one input sequence), not for a single neuron. So the equation is correct but the title is wrong. It should have been:
Equation 141. Output of a recurrent layer for a single instance
I will also fix the sentence introducing this equation, replacing "single recurrent neuron" with "recurrent layer":
The output of a recurrent layer can be computed pretty much as you might expect, as shown in Equation 141 [...]
Thanks for your very helpful feeback,
Aurélien Géron

andre trosky 
Jun 25, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 383
Ch 14, explanation of terms in Eq 142 
Text says

b is a vector of size n_neurons containing each neuron's bias term.
Concern

Using this definition of b, the only way to properly add all the terms inside Eq 142 is to broadcast the bias term. Otherwise we're adding terms of different shapes.
The text does not mention this explicitly and can be confusing if you don't know what's 'going on under the hood', i.e broadcasting of b.
Let's assume the shape of the bias term is (1, n_neurons), therefore having size of n_neurons. In Eq142 (the first line), the other two product terms inside the activation function result in a shape of:
1. Shape of X_(t) . W_x is = (m, n_neurons)
2. Shape of Y_(t1) . W_y is = (m, n_neurons)
Which leads to requiring the bias term to also be of shape (m, n_neurons), so we broadcast m times along b's first dimension.
(This broadcasted shape of b also works in the second line of Eq 142.)
Correction

Maybe mention that the bias is being broadcasted (for those of us who are unfamiliar with it), or otherwise change the definition of its shape to be (m, n_neurons)?
Note from the Author or Editor: That's a great point. In fact, I should have mentioned this earlier, in chapter 10, the first time we use broadcasting when adding a bias vector. I just added the following sentence at the end of point 5 at the bottom of page 266:
Note that adding a 1D array (*b*) to a 2D matrix with the same number of columns (*X* . *W*) results in adding the 1D array to every row in the matrix: this is called _broadcasting_.
Thanks a lot,
Aurélien Géron

andre trosky 
Jun 25, 2017 
Aug 18, 2017 
Other Digital Version 
386
Jupyter notebook 
The problem I observed was actually with the Jupyter notebook "14_recurrent_neural_networks.ipynb" currently (2017 July 13) on GitHub  but the particular code with the problem is associated approximately with the text on page 386 of the printed book (illustrating the "static_rnn()" function).
Specifically, the output of "In [14]:" (show_graph(tf.get_default_graph())), which is supposed to be a graph of some kind, is instead a big empty space (1200 px X 620 px).
Similarly, the output of "In [26]:", in code demonstrating the result of "dynamic_rnn()", is also a big blank space.
Looking at the Firefox webdeveloper "Console" window, I see two JavaScript logging items which seem to say that "HTML Sanitizer" has changed the "iframe.srcdoc" value from what appears to be meaningful data to "null". Specifically, code in "/notebook/js/main.min.js" seems to be the place doing the sanitizing.
Configuration: Windows 7 64bit, Firefox 48.0, Anaconda3 version 4.4.0 (20170511), Python 3.6.1, Jupyter 5.0.0, TensorFlow 1.2.1. So, some of the package versions are later than the book, but I think the issue here is worth investigating.
Aside: This particular notebook ("14_recurrent_neural_networks.ipynb") currently contains a few more minor problems: "In [69]:", "In [77]:", and "In [103]", all call functions which begin with "rnd". However, while it seems previous versions of the notebook included a statement "import numpy.random as rnd", the code has evidently been changed so that "rnd" is no longer defined. Changing the three instances of "rnd" to "numpy.random" fixes all three problems  enabling the entire notebook to be executed in Jupyter (but the problem mentioned at the top of this note, namely the blank graph areas, remains; but does not cause the notebook to stall execution midway, perhaps because the operation succeeded but was "sanitized" away).
Note from the Author or Editor: Thanks for your feedback. I just fixed the `rnd` issue in the Jupyter notebook, and I pushed the updated notebook to github (FYI, I use these imports so often that I added them to my python startup script, which is why I was not getting any error).
Regarding the `show_graph()` function, it does not seem to work across all browsers, unfortunately. I use Chrome, and the graph is displayed just fine, but some people have reported that it fails on Firefox, indeed. I'll try to find a way to make it work in Firefox, but in the meantime, the official way to visualize a TensorFlow graph is to use TensorBoard (see chapter 9).

Colin Fahey 
Jul 13, 2017 
Aug 18, 2017 
Printed 
Page 386
Las paragraph 
"... each with an input sequence composed of exactly two inputs..."
Should not it be three inputs? The minibatches are 4 by 3. If it 2 and I got it wrong, then perhaps it should be clarified.
Note from the Author or Editor: Thanks for your question. The text is correct, it is exactly two inputs, but I changed the wording to clarify:
BEFORE:
This minibatch contains four instances, each with an input sequence composed of exactly two inputs.
AFTER:
This minibatch contains four instances, where each instance is a sequence composed of exactly two 3D inputs. For example, the first instance is the sequence [0, 1, 2], [9, 8, 7].
I hope this is clearer.

Juan Manuel Parrilla Gutierrez 
Nov 05, 2018 
Dec 07, 2018 
Printed 
Page 395
Figure 148 
OutputConnectionWrapper should be OutputProjectionWrapper.
Note from the Author or Editor: Good catch, indeed this was a typo: it's not OutputConnectionWrapper but OutputProjectionWrapper. The notebook was okay though. I fixed the book.
Thanks!

Anonymous 
Sep 29, 2017 
Nov 03, 2017 
PDF 
Page 405
6th line of the code 
reuse_vars_dict = dict([(var.name, var.name) for var in reuse_vars])
should be:
reuse_vars_dict = dict([(var.name, var) for var in reuse_vars])
Note from the Author or Editor: Good catch, thanks! Indeed, it should read:
reuse_vars_dict = dict([(var.name, var) for var in reuse_vars])
I've updated the book, it should be live within a few weeks for the digital versions.

James Wong 
May 17, 2017 
Jun 09, 2017 
Printed 
Page 407
Equation 144 
The equation for h_t appears to be incorrect. Instead of
h_t = (1  z_t) * h_(t  1) + z_t * g_t
the Cho et al. (2014) paper has in equation 7
h_t = z_t * h_(t  1) + (1  z_t) * g_t
Accordingly, the “1” unit in figure 1414 on p. 406 should be moved right, to the path leading from z_t to the multiplication with the output of g_t. (And the label for the g_t is missing.)
Your visualizations of the RNN cells are a great help, and are much appreciated!
Note from the Author or Editor: Thanks for your feedback. You are right that my graph & equations inverted z_t and 1  z_t. Fortunately, the GRU cell works fine either way. Indeed, the z gate is trying to learn the right balance between forgetting old memories (let's call this f) and storing new ones (let's call this i). In a GRU cell, f = 1  i. In the paper, the z gate outputs f, while in my book, it outputs i. Either way, the right balance will be found just as well.
If you want an analogy, it's as if you were learning how empty a glass should be, while I was learning how full it should be. The net result is the same, but somehow I find the latter a bit more natural. ;) That said, even though "my" equations will work fine, I will fix them so that people don't get confused when they see other implementations or read the paper.

Nick Pogrebnyakov 
Oct 10, 2017 
Nov 03, 2017 
Printed 
Page 420
1st sentence 
(1st edition)
In first bullet above 'Training One Autoencoder at a Time',
"First, weight3 and ..."
should be
"First, weights3 and ..."
Thanks.
Note from the Author or Editor: Yet another good catch, thanks Haesun. Fixed to weights3.

Haesun Park 
Dec 26, 2017 
Oct 12, 2018 
Other Digital Version 
422
last paragraph 
Jupyter Notebook: 15_autoencoders
cell : Unsupervised pretraining, In [30]
 weights3_init = initializer([n_hidden2, n_hidden3]) to;
weights3_init = initializer([n_hidden2, n_outputs])
and
 biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3") to;
biases3 = tf.Variable(tf.zeros(n_outputs), name="biases3")
Wondering why is that,
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
is not causing an error?
Note from the Author or Editor: Nice catch! Indeed, this is a typo. It does not explode because n_hidden3 is defined in [23], and it is equal to 300. So the network has 300 outputs instead of 10. The function sparse_softmax_cross_entropy_with_logits() does not explode because it expects the target labels to be between 0 and 299, which is the case (since the labels are between 0 and 9). So the network simply learns to ignore classes 10 to 299.
I'll fix this today, thanks a lot for your feedback, this is very helpful. :)

Anonymous 
Sep 08, 2017 
Nov 03, 2017 
Printed 
Page 425
Under a note 
(1st edition)
A paper link (https:\/\/goo.gl/R5L7HJ) is broken.
Please refer this(http:\/\/www.iro.umontreal.ca/~lisa/pointeurs/BengioNips2006All.pdf)
Thanks.
Note from the Author or Editor: Thanks Haesun. I updated the short link to: https://goo.gl/smywDc
It points to: https://papers.nips.cc/paper/3048greedylayerwisetrainingofdeepnetworks.pdf
which seems more likely to be stable, given that it's hosted by nips.cc instead of a user folder.
Thanks again!

Haesun Park 
Dec 27, 2017 
Oct 12, 2018 
Printed 
Page 436
Exercises 8 
In first bullet of Ex 8,
A short url for "download_and_convert_data.py" is broken.
It should be linked to "https://github.com/tensorflow/models/blob/master/research/slim/download_and_convert_data.py"
Thanks.
Note from the Author or Editor: Thanks Haesun. Yikes, it's the second time this link breaks, they keep moving folders around. Perhaps I should point to a search query instead. ;)
For now, I've updated the link to this short link: https://goo.gl/fmbnyg

Haesun Park 
Dec 29, 2017 
Oct 12, 2018 
Printed 
Page 437
Exercises 9 
(1st edition) In last bullet of Ex 9,
"Jinma Gua" should be "Jinma Guo"
Thanks :)
Note from the Author or Editor: Good catch, thanks Haesun. Fixed to Guo.

Haesun Park 
Dec 29, 2017 
Oct 12, 2018 
Printed 
Page 439
footnote 1. 
(1st edition)
In footnote 1, RL book link(https://goo.gl/7utZaz) is broken,
Please refer this(http:\/\/www.incompleteideas.net/book/thebook2nd.html)
Thanks.
Note from the Author or Editor: Thanks Haesun. I actually fixed this link already in the latest release:
https://goo.gl/K1Gibs
But your link is better, as it points to the latest edition, so I'm updating it to:
https://goo.gl/AZzunZ
Thanks again!

Haesun Park 
Jan 02, 2018 
Oct 12, 2018 
Printed 
Page 441
bottom 
More information in RL: See footnote 1:
https://goo.gl/7utZaz
This link can't be executed and return a "404 Not Found"
Note from the Author or Editor: Thanks for your feedback. Indeed, this URL was broken, I fixed it in the latest release. The new URL for this book is: https://goo.gl/AZzunZ
Thanks again,
Aurélien

Anonymous 
Apr 01, 2018 
Oct 12, 2018 
Printed 
Page 446
2nd code block 
The code shows the creation of "your first environment" and reads:
>>> import gym
>>> env = gym.make("CartPolev0")
[20161014 16:03:23,199] Making new env: Ms Pacmanv0
[,...]
the output code was probably copied from further onto the chapter, since it should be (and I quote my own output)
[20170913 10:48:27,402] Making new env: CartPolev0
Note from the Author or Editor: Nice catch, thanks! I fixed this, the next digital and paper editions should be good.

Francesco Siani 
Sep 13, 2017 
Nov 03, 2017 
Printed, Safari Books Online 
Page 449
Last paragraph 
In Chpater 16,
Discount rate(r) is different discount factor(\gamma).
Discount factor \gamma = 1/(1+r).
So I recommend:
'discount rate' in text should be 'discount factor'.
'discount_rate' in code should be 'discount_factor'.
Thanks.
Note from the Author or Editor: Thanks Haesun, that's a good point. I used "discount rate" to mean "discount factor", and I have seen several people do the same, but you are right that it's clearer to replace "discount rate" with "discount factor" everywhere in chapter 16. I just did this. In the code examples, I have a constraint of using 80 characters max per line, so I cannot easily replace discount_rate with discount_factor, so instead I replaced discount_rate with gamma, with a comment in the code every time I define gamma, for example:
gamma = 0.95 # the discount factor
Also, the first time that the discount factor is introduced (just before figure 166), instead of naming it "r", I named it gamma. This avoids possible confusion with rewards (which are named "r") later in the chapter, and it also makes the chapter more consistent.
Thanks for your suggestion!

Haesun Park 
Jan 11, 2018 
Oct 12, 2018 
Printed 
Page 456
2d paragraph 
Labels (a1, a2, s2, s3) in the text for Figure168 are incorrectly printed.
Note from the Author or Editor: Thanks for your feedback. I'm not sure exactly what you mean by "printed incorrectly". Are you referring to the text font (I am not seeing a problem)? Or to the fact that the text contained a couple errors (e.g., inverted a1 and a2, and s2 and s3)? I assume it's the latter. I fixed these errors:
BEFORE: In state _s_~1~ it has only two possible actions: _a_~0~ or _a_~1~. It can choose to stay put by repeatedly choosing action _a_~1~, or it can choose to move on to state _s_~2~ and get a negative reward of 50 (ouch). In state _s_~3~ it has no other choice [...] and in state _s_~3~ the agent has no choice but to take action [...].
AFTER: In state _s_~1~ it has only two possible actions: _a_~0~ or _a_~2~. It can choose to stay put by repeatedly choosing action _a_~0~, or it can choose to move on to state _s_~2~ and get a negative reward of 50 (ouch). In state _s_~2~ it has no other choice [...] and in state _s_~2~ the agent has no choice but to take action [...]
Thanks again!
Aurélien

Yevgeniy Davletshin 
Jun 15, 2017 
Aug 18, 2017 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 459
Code section under "Now let’s run the QValue Iteration algorithm" 
The learning_rate is defined as 0.01 but it is never used. It is not a real problem as the algorithm doesn't take a learning rate. But it is confusing when you read the code.
Note from the Author or Editor: Indeed, the learning_rate is unused in this code, I just removed it.
Thanks!
Aurélien Géron

Hei 
Jun 23, 2017 
Aug 18, 2017 
Printed 
Page 459
Code at top of page 
Comparing the contents of the array R in the Python code to Figure 168, I believe the second occurrence of the value 10.0 in the definition of R should be 0.0, in this code:
R = np.array([ # shape=[s, a, s']
:
[[10., 0.0, 0.0], [nan, nan, nan], [0.0, 0.0, 50.]],
:
])
There is no transition from state S1 through action a0 back to state S0, so a reward of +10 does not do anything here.
It also does no harm, but it may reduce some confusion :)
Note from the Author or Editor: Ha! Good catch! :) Indeed, that line should read:
[[0.0, 0.0, 0.0], [nan, nan, nan], [0.0, 0.0, 50.]],
As you point out, this is a reward for a transition that has 0% probability, so it doesn't change the result, but I agree that it's potentially confusing. I've fixed it in the book (the notebook was already okay, somehow, I must have noticed the issue in the notebook at some point, but forgot to fix it in the book).
Thanks for your help! :)

Wouter Hobers 
Sep 28, 2017 
Nov 03, 2017 
PDF 
Page 460
Equation 166 
\max_{\alpha'} should say \max_{a'}
Note from the Author or Editor: Great catch, thanks! Alpha looks so much like "a", especially in small font like this, I would never have noticed. The error is fixed now.

joseluisfb 
Apr 18, 2018 
Oct 12, 2018 
PDF 
Page 461
QLearning code example 
I've downloaded the latest version of the PDF and ePub from my OReilly account.
PDF is version is 20170609.
ePub is version 20170609
Concern

The code for QLearning seems to not match Equation 165. In particular, how the learning rate (aka alpha) is used.
Currently code reads:
[...]
Q[s, a] = learning_rate * Q[s, a] + (1  learning_rate) * (
reward + discount_rate * np.max(Q[sp])
)
[...]
To agree with Equation 165 it should be:
[...]
Q[s, a] = (1  learning_rate) * Q[s, a] + learning_rate* (
reward + discount_rate * np.max(Q[sp])
)
[...]
I've checked though the Jupyter notebook for the Reinforcement chapter and it looks to agree with Equation 165, albeit the code is setup a little different.
Aside

It looks like this latest version of the PDF, ePub doesn't have some of the corrections previously fixed.
e.g. The description for the states in Figure 167 p456 PDF still refer to the nonexistent state s3. I haven't checked any other fixed errata but could it be that O'Reilly have not correctly setup/linked to the newest uptodate version? Just adding another 'data point' to hopefully help if it's confusing others.
Or I'm doing something wrong. I don't know.
Almost at the end :) Thanks again for the book, stuff is finally 'clicking'.
Note from the Author or Editor: Good catch, thanks! Indeed, the code was wrong, it should have been as you said, reversing (1  learning_rate) and learning_rate, just like in Equation 165. I just pushed the fix to O'Reilly's git repo, so both the digital editions and new printed books should be fixed within the next couple of weeks.
Regarding the description of Figure 167, it is normal that it mentions state s3 since that state exists on the figure. However, if you see state s3 still mentioned in the description of Figure 168, then there's a problem. I'll contact O'Reilly to to make 100% sure that all the digital editions are up to date.
Note: I have sync'ed the code examples from all chapters with the code in the Jupyter notebooks, except for chapters 15 and 16, which are not 100% synchronized yet.

andre trosky 
Jul 20, 2017 
Aug 18, 2017 
Printed, Safari Books Online 
Page 461
code block 
(In revised Printed Version and Safari Online)
Above 'Exploration Policies', Q[s, a] assignment need a closing parenthesis.
Q[s, a] = ((1  learning_rate) * Q[s, a] +
learning_rate * (reward + discount_rate * np.max(Q[sp]))
should be
Q[s, a] = ((1  learning_rate) * Q[s, a] +
learning_rate * (reward + discount_rate * np.max(Q[sp])))
And, at small code block for X_action placeholder and q_value in page 468,
Loss should be calculated from online_q_value not target_q_value.
q_value = tf.reduce_sum(target_q_values * tf.one_hot(X_action, n_outputs),
axis=1, keep_dims=True)
should be
q_value = tf.reduce_sum(online_q_values * tf.one_hot(X_action, n_outputs),
axis=1, keep_dims=True)
Thanks.
Note from the Author or Editor: Thanks Haesun, indeed there was a missing closing parentheses. I just fixed this.

Haesun Park 
Jan 10, 2018 
Oct 12, 2018 
Printed 
Page 469
the main loop 
Dear Mr. Géron,
First thank you very much for the wonderful book!
I am a bit confused when comparing the book with the nature paper "Humanlevel control through deep reinforcement learning". Please see Algorithm 1 in Methods.
Is there an exact correspondence between actor/critic in your book, and theta/theta^ in the paper? In the paper theta plays AND learns, however in the book actor plays and critic learns.
Thank you again for the book and for you precious time!
All the best,
Yehua
Note from the Author or Editor: Thanks a lot for your question, you helped me find the worst errors so far in the book. I fixed the Jupyter notebook for chapter 16 and I added a message at the beginning of the "Learning to play MsPacman with the DQN algorithm" section with the details of the errors:
1. The actor DQN and critic DQN should have been named "online DQN" and "target DQN" respectively. Actorcritic algorithms are a distinct class of algorithms.
2. The online DQN is the one that learns and is copied to the target DQN at regular intervals. The target DQN's only role is to estimate the next state's QValues for each possible action. This is needed to compute the target QValues for training the online DQN, as shown in this equation:
y(s,a) =r + g * max_a' Q_target(s′,a′)
* y(s,a) is the target QValue to train the online DQN for the stateaction pair (s,a).
* r is the reward actually collected after playing action a in state s.
* g is the discount rate.
* s′ is the state actually reached after played action a in state s.
* a′ is one of the possible actions in state s′.
* max_a' means "max over all possible actions a' "
* Q_target(s′,a′) is the target DQN's estimate of the QValue of playing action a′ while in state s′.
In regular approximate QLearning, there would be a single model Q(s,a), which would be used both for predicting Q(s,a) and for computing the target using the equation above (which involves Q(s', a')). That's a bit like a dog chasing its tail: the model builds its own target, so there can be feedback loops, which can result in instabilities (oscillations, divergence, freeze, and so on). By having a separate model for building the targets, and by updating it not too often, feedback loops are much less likely to affect training.
Apart from that I tweaked a few hyperparameters and I updated the cost function, but those are minor details in comparison.
I hope these errors did not affect you too much, and if they did, I sincerely apologize.
Postmortem, lessons I learned:
1. Spend more time reading the original papers and less time (mis)interpreting people's various implementations.
2. Use proper metrics to observe progress (e.g., track the max QValue or the total rewards per game), instead of falling into the confirmation bias trap of thinking that the agent is making progress when it is not. Testing on a simpler problem first would also have been a good idea.
3. Be extra careful when you reach the final section of the final chapter: that's when you're most tempted to rush and make mistakes.
Again, I would like to thank you for bringing this issue to my attention, it's great to get such constructive feedback.
Cheers,
Aurélien Géron

Yehua Liu 
Aug 10, 2017 
Nov 03, 2017 
Printed 
Page 474
2nd line in 1st code segment 
I think q_value should be calculated using online_q_values instead of target_q_values.
Great book, super useful and clear, thanks!
Note from the Author or Editor: Great catch! Indeed, it should be online_q_values instead of target_q_values, thanks a lot!
(I just checked, the Jupyter notebook was okay, so I guess I fixed the notebook some time ago, and I forgot to fix the text, sorry about that).

Sebastian Lehner 
Dec 02, 2018 
Mar 08, 2019 
Printed, Safari Books Online 
Page 479
Chapter 6's ex. 2 
Chapter 6's ex. 2
Gini impurity calculation looks like 11^2/54^2/5=0.32 and 11^2/21^2/2=0.5
Adding parentheses is better. e.g. 1(1/5)^2(4/5)^2=0.32
Thanks.
Note from the Author or Editor: Thanks Haesun, indeed this notation can be confusing. I added parentheses.

Haesun Park 
Jan 20, 2018 
Oct 12, 2018 
Printed, PDF, ePub, Mobi, Safari Books Online, Other Digital Version 
Page 486
Last bullet point 
In the last subquestion of the chapter 10 exercises you ask us to write the equation that computes the network output matrix Y as a function of X, W_h, b_h, W_o, and b_o.
You give the solution as follows.
Y = (X \cdot W_h + b_h) \cdot W_o + b_o
I understand why this could equation for Y could be correct but only if we ignore the ReLU activation functions for all of the artificial neurons.
It seems the solution would change when considering the activation functions of the 50 artificial neurons in the hidden layer and the 3 artificial neurons in the output layer, which all have ReLU activation.
When considering the ReLU activation of the 53 total artificial neurons would this be the a correct equation?
Y = max(max(X \cdot W_h + b_h, 0) \cdot W_o + b_o), 0)
Regardless of whether my equation is correct, I think this would be a more complete and informative exercise if you provided how the equation provided as the solution in the appendix would change (or not) when we consider the ReLU activation functions that you posed in the original question.
Otherwise, this is a very good and helpful exercise!
Note from the Author or Editor: Good catch, you are right, I forgot the ReLU activations! :( The answer should indeed be:
Y = max(max(X \cdot W_h + b_h, 0) \cdot W_o + b_o), 0)
It's also fine to write ReLU(z) instead of max(z, 0):
Y = ReLU(ReLU(X . W_h + b_h) . W_o + b_o)
I just updated the book, the digital versions will be updated with a couple weeks.

Shane 
May 27, 2017 
Jun 09, 2017 
Printed 
Page 491
3rd Paragraph 
The answer to the second part of question 2 in Chapter 13: Convolutional Neural Networks reads:
"...this first layer takes up 4 x 100 x 150 x 100 = 6 million bytes (about 5.7 MB)...The second layer takes up 4 x 50 x 75 x 200 = 3 million bytes (about 2.9 MB). Finally, the third layer takes up 4 x 25 x 38 x 400  1,520,000 bytes (about 1.4 MB). However, once a layer has been computed, the memory occupied by the previous layer can be released, so if everything is well optimized, only 6 + 9 = 15 billion bytes (about 14.3 MB) of RAM will be required (when the second layer has just been computed, but the memory occupied by the first layer is not released yet)."
For the situation described, if both the first and second layers are in memory, would that not be 3 + 6 = 9 million bytes (8.58 MB) of RAM required? When you add the amount occupied by the CNN's parameters (3,613,600 bytes) that would be a total of about 12 MB for predicting a single instance.
I could also be missing something really obvious so sorry if that is the case. Either way, thanks for the great, enjoyable book!
Note from the Author or Editor: You are correct, I have no idea why I wrote 6+9 instead of 6+3. Thanks a lot!
I just fixed the paragraph like this:
"""
However, once a layer has been computed, the memory occupied by the previous layer can be released, so if everything is well optimized, only 6 + 3 = 9 million bytes (about 8.6 MB) of RAM will be required (when the second layer has just been computed, but the memory occupied by the first layer is not released yet). But wait, you also need to add the memory occupied by the CNN's parameters. We computed earlier that it has 903,400 parameters, each using up 4 bytes, so this adds 3,613,600 bytes (about 3.4 MB). The total RAM required is (at least) 12,613,600 bytes (about 12.0 MB).
"""

Will Koehrsen 
Jul 21, 2017 
Aug 18, 2017 
PDF, Safari Books Online 
Page 523
Last paragraph 
1st edition 5th release.
If a Hopfield nets contain 36 neurons, total connection is 630(=36*35/2) not 648. :)
Thanks.
Note from the Author or Editor: Good catch! Of course if there are n neurons, then there are 1+2+3+...+(n1) = (n  1) * n / 2 connections. It seems that I computed 36*36/2 instead of 35*36/2, probably a typo on my calculator. :/
Fixed, thanks once more!

Haesun Park 
Mar 08, 2018 
Oct 12, 2018 
Printed 
Page 550
Colophon 
My friend showed me his at the time favourite book and I woundered about the salamander mascot on the cover. I read the explanation from last page. Sorry, but the salamander shown is definitly not Salamandra infraimmaculata, but our native species Salamandra salamandra (I am a german biologist). S. infraimmaculata would have a more rounded head, a slightly different drawing, and very important NO black pigmentated ends of parotideal excretory ducts. Amongst other things.
So I checked the original source "The Illustrated Natural History". There, the amphibian of your picture is specified as S. maculata. This epitheton was used until 1955 for S. salamandra and is now synonymous with it.
Moreover, you described the Near Eastern fire salamander (Salamandra infraimmaculata) as "Far Eastern fire salamander found in the Middle East". Very Confusing and incorrect. Furthermore, no species of fire salamanders "lays their eggs in the water". In contrast to common frogs, fire salamanders are ovoviviparous. They deposit living tadpoles into the water.
Note from the Author or Editor: Thanks a lot for your very interesting feedback. I will forward your message to O'Reilly: they are the ones who select the animals on the book covers, and who write the corresponding text. Hopefully, they will fix this by the next release of the book.

Dr. Verena Wilhelmi 
Oct 23, 2018 
Dec 07, 2018 
