The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
PDF, ePub, Mobi, |
In 'Execution Phase' |
In Chapter Ten, 'Execution Phase'
Text currently says
"Next, at the end of each epoch, the code evaluates the model on the last mini-batch and on the full training set, and it prints out the result."
I believe it should read
"Next, at the end of each epoch, the code evaluates the model on the last mini-batch and on the full test set, and it prints out the result."
Test not training.
Note from the Author or Editor: Indeed, it should be "test" instead of "training", good catch.
|
Kendra Vant |
Apr 08, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page Online
Chapter 11, Reusing Pretrained Layers |
Just a word order switch typo in the Note:
"More generally, transfer learning will work only well if the inputs have similar low-level features."
should rather be
"More generally, transfer learning will only work well if the inputs have similar low-level features."
'work' and 'only' reversed order.
Note from the Author or Editor: Thank you, indeed this is my French brain interfering with my writing! ;)
|
Kendra Vant |
Apr 08, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Chapter 16, Policy Gradients section, below "On to the execution phase!" 40% down in Online page |
The code for the execution phase is missing a parameter in the 'discount_and_normalize_rewards function: this function calls the discount_rewards function and assigns the return to 'all_discounted_rewards' but only passes one parameter, where discount_rewards expects two parameters. The github code for discount_and_normalize_rewards is correct, the online/Safari book code is incorrect.
def discount_and_normalize_rewards(all_rewards, discount_rate):
all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards] <<<ISSUE IS HERE, MISSING PARAM>>>
flat_rewards = np.concatenate(all_discounted_rewards)
reward_mean = flat_rewards.mean()
reward_std = flat_rewards.std()
return [(discounted_rewards - reward_mean)/reward_std for discounted_rewards in all_discounted_rewards]
Note from the Author or Editor: Good catch, thank you! I tested every single code example before adding it to the book, but it seems that I made a modification to the notebook and forgot to update the book. I fixed the error, it will be reflected in the digital versions within the next few weeks.
|
Steve Dotson |
Apr 16, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page ch 14
Equation 14-4. GRU computations |
Something is wrong with the last equation, h(t) = (1 - z(t)) ¤ tanh (WxgT * h(t-1) + z(t)¤gt)
I think it should be: h(t) = (1 - z(t)) ¤ h(t-1) + z(t) ¤ g(t)
Note from the Author or Editor: Yes indeed, you are absolutely right, I don't know what I was thinking when I wrote this equation, I apologize. The correct equation to compute h(t) is, as you wrote:
h(t) = (1 - z(t)) ⊗ h(t-1) + z(t) ⊗ g(t)
The equation in latexmath format is:
\mathbf{h}_{(t)}&=(1-\mathbf{z}_{(t)}) \otimes \mathbf{h}_{(t-1)} + \mathbf{z}_{(t)} \otimes \mathbf{g}_{(t)}
Thanks again for contributing to improving this book, I hope you are enjoying it.
|
Per Thorell |
May 06, 2017 |
Jun 09, 2017 |
|
Ch. 2, Select a Performance Measure, 3rd bullet point |
Regarding the L0 norm, the text says "L0 just gives the cardinality of the vector (i.e., the number of elements)...". It may be clearer if the text says: "the number of non-zero elements"
Note from the Author or Editor: Good point, thanks. I actually fixed this a few weeks ago. The online version and the latest printed copies should be fixed by now. The sentence is now:
"ℓ0 just gives the number of non-zero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector."
|
Eric T. |
Jun 09, 2017 |
Aug 18, 2017 |
Other Digital Version |
kindle 1127
Above Figure 2-13 |
On Figure 2-13 the axis values and legend is not shown. This due to a bug in #Matplotib inline on "scatter" plots. The attributes are hidden. Bellow you can find a temporary solution using sharex=False to restore visibility. The comment line cites the source for the solution.
housing2.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.4, s = housing2["population"]/100,
label = "population", c = "median_house_value", cmap = plt.get_cmap("jet"), sharex=False)
# sharex=False fixes a bug. Temporary solution. See: https://github.com/pandas-dev/pandas/issues/10611
Note from the Author or Editor: I just love it when people come with both the problem and the solution! :)
I just tried your bug fix and it works fine, thanks a lot.
|
Wilmer Arellano |
Jun 05, 2017 |
Jun 09, 2017 |
Other Digital Version |
Kindle Loc 1141
After Figure 2-13 |
Values obtained from running the code are different from what is printed on the book:
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
A previous table seems to indicate that the csv file is fine:
housing["income_cat"].value_counts() / len(housing)
3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64
Here running the code produces same results as the book.
Why the difference on the first table?
Thank you.
Note from the Author or Editor: Thanks for your feedback.
I am adding the following note to the Jupyter notebooks:
"You may find little differences between the code outputs in the book and in the Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I am currently adding notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book."
In this particular case, I think the difference is probably due to the fact that the training set was initially sampled differently (in fact, it had one more item). When I tweaked the notebook and ran it again, I updated the code and code outputs in the book, but I forgot to update a few outputs (probably because they look so similar). You may find a few other differences, but as I mentioned they really don't change the ideas discussed in the book. I recently fixed them, so the online and future paper reprints will be more consistent with the notebooks.
Thanks again!
|
Wilmer Arellano |
Jun 07, 2017 |
Aug 18, 2017 |
|
Batch Gradient Descent
Bottom Box |
I find the discussion of convergence rate for Batch Gradient Descent a bit hard to follow. First of all, the relation between epsilon and convergence rate is never formally defined, so the simple math example you give does not immediately follow for me. I think the discussion would make more sense if it were written that the number of needed iterations is of order O(1/epsilon), which I'm pretty sure is correct.
Note from the Author or Editor: Good point, thanks for your feedback. This paragraph does need some clarification. I meant to say that the distance between the current point and the optimal point shrinks at the same rate as 1/iteration. However this depends on the cost function's shape (convergence is much faster if the cost function is strongly convex). I propose to replace the paragraph with this one:
When the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while: it can take O(1/epsilon) iterations to reach the optimum with a tolerance of epsilon, depending on the shape of the cost function. If you divide the tolerance by 10 to have a more precise solution, then the algorithm will have to run about 10 times longer.
If you are interested, this post by RadhaKrishna Ganti goes into much more depth:
https://rkganti.wordpress.com/2015/08/21/convergence-rate-of-gradient-descent-algorithm/
Or this post by Sebastien Bubeck:
https://blogs.princeton.edu/imabandit/2013/04/04/orf523-strong-convexity/
Or there is the "Convex Optimization" book by Stephen Boyd and Lieven Vandenberghe:
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
|
Anonymous |
Jul 09, 2017 |
Aug 18, 2017 |
Printed |
Page xiii
2nd paragraph |
The word "and" is misplaced in the second paragraph, first sentence. It currently reads:
"...and recommending videos, beating the world champion at the game of Go."
It should read:
"...recommending videos, and beating the world champion at the game of Go."
Note from the Author or Editor: Indeed, good catch, thanks! Fixed.
|
Daniel J Barrett |
Jan 30, 2018 |
Oct 12, 2018 |
Printed |
Cover spine |
The title of book is written on the spine as follows:
"Hands-On Machine Learning
with Scikitt-Learn & TensorFlow"
Scikit is mistakenly spelled with an extra "t".
|
Jeremy Joseph |
Feb 22, 2018 |
Oct 12, 2018 |
|
chapter 5
sentence immediately before "Online SVMs" heading |
From book: "it’s an unfortunate side effects of the kernel trick."
Problem: "an" requires a singular noun, but "effects" is a plural noun.
Note from the Author or Editor: Indeed, thanks! I just fixed the mistake (an unfortunate side effects=>an unfortunate side effect).
|
Anonymous |
Mar 21, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 19
Equation 1-1 |
"life_satisfaction" has been formatted like a formula definition, with extra space around each "f".
Note from the Author or Editor: Thanks. I updated the latex code:
Before:
life\_satisfaction = \theta_0 + \theta_1 \times GDP\_per\_capita
After:
\text{life_satisfaction} = \theta_0 + \theta_1 \times \text{GDP_per_capita}
|
anthonyelizondo |
Apr 26, 2017 |
Jun 09, 2017 |
PDF |
Page 19
Equation 1-1 |
1st Edition 2nd Release,
\theta_0 is missing in Equation 1-1. :)
|
Haesun Park |
Jun 11, 2017 |
Jun 12, 2017 |
PDF |
Page 26
Last word in first line |
"loo-sing" (hyphenated across lines) should be losing.
Note from the Author or Editor: Thanks, this is fixed now.
Aurélien
|
C.R. Myers |
Sep 02, 2016 |
Mar 10, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 30
Last paragraph |
In "No Free Lunch Theorem" side note,
"http://goo.gl/3zaHIZ" is broken.
I found another one,
https://www.researchgate.net/profile/David_Wolpert/publication/2755783_The_Lack_of_A_Priori_Distinctions_Between_Learning_Algorithms/links/54242c890cf238c6ea6e973c/The-Lack-of-A-Priori-Distinctions-Between-Learning-Algorithms.pdf
Note from the Author or Editor: Thanks, indeed the page seems to have been removed. Perhaps linking to a Google Scholar search will be more stable: https://goo.gl/dzp946
|
Haesun Park |
May 12, 2017 |
Jun 09, 2017 |
Printed |
Page 30
Footnote |
The reference for the "no free lunch" paper has the author name spelled as Wolperts but should be Wolpert (no final "s").
Note from the Author or Editor: Good catch, thanks! Error fixed.
|
Marco Cova |
Apr 07, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 37
3rd paragraph (1st paragraph under "Select a Performance Measure") |
It is stated that "It [RMSE] measures the standard deviation of the errors the system makes in its predictions". This is incorrect; the standard deviation is the square root of the variance (as noted by the author in a footnote), and though similar to RMSE, is not quite the same as it. Note that standard deviation is an "averaged" measure of deviation from the mean of the values, while RMSE is an "averaged" measure of deviation between the values themselves. Standard deviation measures the "spread" of the data from the mean, while RMSE measures the "distance" between the values.
This makes the subsequent statement "For example, an RMSE equal to...of the actual value." incorrect as well.
Please view the answer here for a very clear explanation of this:
https://stats.stackexchange.com/questions/242787/how-to-interpret-root-mean-squared-error-rmse-vs-standard-deviation
Note from the Author or Editor: You are absolutely correct, thanks for your feedback. I am currently working on the French translation of this book, and I actually stumbled across this sentence just last week: my heart almost stopped! It was a great disappointment to find such an error despite all my efforts to check and double-check everything. So far, the other errors had mostly been typos, but this one is serious. :(
I rewrote the paragraph like so: "It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. Equation 2-1 shows the mathematical formula to compute the RMSE."
The digital versions will be updated within a few weeks.
My sincere apologies,
Aurélien
|
Jobin Idiculla |
May 20, 2017 |
Jun 09, 2017 |
Other Digital Version |
37, 38, ...
All display equations |
I bought the book from Amazon, in Kindle format.
I'm not sure if this is an O'Reilly problem or a Kindle problem, but most of the display equations look terrible: The math font is about five times larger than the text font, the symbols overlap, and some equations are clipped so that they are illegible.
(Other than that, I'm very happy with the actual content of the book.)
Note from the Author or Editor: Thanks for your feedback. I'm really sorry about this issue, I just forwarded your message to the production team at O'Reilly, I'll get back to you as soon as they answer (they are usually very responsive). Could you please specify which Kindle model you have exactly, it might be specific to a particular model, I'm not sure (I don't have a Kindle, so I can't reproduce the issue).
In the meantime I'll extract all the math equations from the book and post them to the github project (https://github.com/ageron/handson-ml).
Hope this helps,
Aurélien
|
Anonymous |
Jun 21, 2017 |
Jul 07, 2017 |
Other Digital Version |
39
Towards end (third of a series of bullet points about norm definitions) |
In Kindle edition, the inline formula for the l_k norm is unreadably small. (Earlier on the page, the formula for the Mean Absolute Error is enormous, but this is not a problem, just slightly poor formatting).
Note from the Author or Editor: Thanks for your feedback, and I'm very sorry for the problem you are experiencing.
We had this problem before, but I thought it was fixed around September. If you bought the book before that, could you please try updating it, hopefully this should fix the issue.
I will report this issue nonetheless to O'Reilly, just in case the problem came back for some reason. If this is so, then I will update this message.
When we had equation formatting problems last summer, I created a Jupyter notebook containing all the book's equations. You can get it here:
https://github.com/ageron/handson-ml/blob/master/book_equations.ipynb
Note that github's renderer does not display some of the equations properly, unfortunately, but if you download the notebook and run it in Jupyter, it will display the equations perfectly.
Thanks again for your feedback, and I hope you are enjoying the book despite this formatting issue.
Aurélien
|
Liam Roche |
Nov 23, 2017 |
Oct 12, 2018 |
PDF |
Page 39
The second paragraph below Equation 2-2 |
The "Euclidean norm" is misspelled as "Euclidian norm".
|
Anonymous |
Oct 04, 2018 |
Oct 12, 2018 |
Printed |
Page 41
second chunk of code in the box |
At least on my system (Ubuntu/Kubuntu), pip3 install --user installs virtualenv command in ~/.local/bin, which is not in my PATH. Calling virtualenv provokes a response that the user should install it using sudo apt-get install virtualenv. Doing that leads to problems with mixing versions. So a note on adding ~/.local/bin to the PATH could be useful for inexperienced python programmers like myself --- both in the book and on the github page. BTW, you complained in the errata that \mathbf{\theta} did not work. It should be \bm{\theta}.
Note from the Author or Editor: Hi Jan,
Thanks for your feedback. I'm sorry you had trouble with the installation instructions: I actually hesitated to add any installation instructions to my book, because it's really the sort of things that varies a lot across systems, and changes over time. I'll add a footnote as you suggest, it's a great idea.
Regarding the bold font theta, someone suggested using \bm instead of \mathbf a while ago, and I tried, but it did not work. For example, go to latex2png.com and try running "x \mathbf{x} \bf{x} \theta \mathbf{\theta} \bf {\theta}". I see a normal x, then 2 identical bold x, then 3 identical normal thetas. O'Reilly ended up converting many of the equations to MathML, and then it worked fine.
Cheers,
Aurélien
|
Jan Daciuk |
Nov 15, 2017 |
Oct 12, 2018 |
Printed |
Page 45
Figure 2-6 |
(First Edition)
In a screenshot of figure 2-6,
housing.info should be housing.info() as it is in the notebook on github
Note from the Author or Editor: Thanks for your feedback. The parentheses are actually in very light green in the original image, and when converted to black & white for the printed version, they almost disappear (if you look closely, you can barely see them in very light gray).
I've updated the image and contacted the production team to make sure they'll include the new image in future printed editions.
|
Haesun Park |
May 16, 2017 |
Jun 09, 2017 |
Printed |
Page 45
Figure 2-5 |
When I run the code in Figure 2-5 I get a FileNotFoundError:
file b'datasets\\housing\housing.csv' does not exist.
The code calls load_housing_data but I don't see where fetch_housing_data is called. You have to fetch the data to create the datasets/housing directory. What might I be missing?
Note from the Author or Editor: Thanks for your question. Indeed, you need to call the fetch_housing_data() function, or else the load_housing_data() function will not find the data file (housing.csv). Just below the definition of the fetch_housing_data() function, I wrote "Now when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory". Perhaps I should have been more explicit: "You should now call the fetch_housing_data() function: it will create a datasets/housing directory in your workspace...", etc.
Hope this helps.
|
Iain Watson |
Dec 01, 2018 |
Dec 07, 2018 |
Printed |
Page 46
after 2nd paragraph |
The call to `value_counts()` is displayed as being executed in a standard Python REPL (e.g. with a `>>>` prompt) without any explanation. The use of the Python REPL continues on pages 49, 52, 56, etc.
Perhaps it is worth clarifying that `>>>` implies you can enter the code in the Jupyter notebook or in the REPL.
Note from the Author or Editor: Thanks for your feedback.
Regarding the usage of >>>, I use it for better readability when there's a mix of code and outputs.
For example, consider the following code:
a = 1
b = a + 3
c = a * b
Say I want to show the value of b, I could write:
a = 1
b = a + 3
print(b) # => 4
c = a * b
But that's a bit ugly, especially if the value of b is long or spans multiple lines. So instead I could do something like this:
Code:
a = 1
b = a + 3
print(b)
c = a * b
Output:
4
But then the reader has to go back and forth between the code and the output to understand everything. So perhaps this instead?
a = 1
b = a + 3
print(b)
# 4
c = a * b
That's not bad, actually, but I prefer the >>> notation, because it's more common for python code, it looks like I copy/pasted a piece of python console:
>>> a = 1
>>> b = a + 3
>>> b
4
>>> c = a * b
Now it looks exactly like what you would get in the interpreter, so hopefully it's both clear and natural.
But when there's nothing particular to display, I don't use >>>, I simply write the code. Perhaps this is what confused you? Why do I use this notation sometimes and not other times? I guess I could add a footnote for the first code example that uses this notation, something like this:
When a code example contains a mix of code and outputs, I will use the same format as in the python interpreter, for better readability: the code is prefixed with >>> (or ... for indented blocks), and the outputs have no prefix.
Thanks for the suggestion,
Cheers,
Aurélien
|
Anonymous |
Dec 19, 2017 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 47
2nd paragraph |
The text states "slightly over 800 districts have a median_house_value equal to about $500,000.
I suppose you meant "slightly over 1,000 districts", looking at the peak in the relevant histogram (the x-axis numbers overlap in the book, but it's the lonely peak at the right).
Unless you consider 1,000+ to also be "slightly over 800" :)
Note from the Author or Editor: Good catch, thanks! I actually meant to write "equal to about $100,000".
|
Wouter Hobers |
May 11, 2017 |
Jun 09, 2017 |
Printed |
Page 50
4 |
there's a minor error on page 50 that produces a bug for python version < 3
This line has the hash() function that returns the ascii code number in python 3:
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
in python version 2 this returns a character which breaks the entire function.
Replacing the line with this fixes it:
if sys.version[0] == '3':
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
else:
return ord(hash(np.int64(identifier)).digest()[-1]) < 256 * test_ratio
Cheers
Note from the Author or Editor: Thanks for your feedback. Indeed, this function only works with Python 3.
In the notebook, I proposed a version that supports both Python 2 and 3:
def test_set_check(identifier, test_ratio, hash):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
It's kind of ugly, so I decided to just present the Python 3 version, but I should have added a comment to make it clear.
Side note: most scientific python libraries have announced that they will stop supporting Python 2 very shortly (e.g., NumPy will stop releasing new features in Python 2 at the end of this year, see https://python3statement.org/ for more details).
One problem with the implementation above is that it uses the MD5 hash and only looks at a single byte, so the cut between train and test is rather coarse. Since then, I found a better option using CRC32 (much faster and returning 4 bytes, so it's much more fine grained), which I will be proposing in future releases:
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
This works just as well on Python 2 and Python 3 (in Python 3, you could remove "& 0xffffffff", which is only needed because crc32() returns a signed int32 in Python 2, while it is unsigned int32 in Python 3).
Hope this helps!
Aurélien
|
Anonymous |
Apr 01, 2018 |
Oct 12, 2018 |
Printed |
Page 51
first full paragraph, 5th line |
Hi,
The third sentence of the first full paragraph on page 51 ends with "...they don't just pick 1,000 people randomly in a phone booth."
That's true, but I suspect you intended to write "phone book." (It's hard to fit 1K people into a phone booth.)
Thanks so much for writing this book!
Best,
--Jeff
Note from the Author or Editor: Hi Jeff,
Ha ha, very funny! :) I fixed the error, thanks a lot for your feedback and your sense of humor.
Cheers,
Aurélien
|
Jeff Lerman |
Nov 28, 2017 |
Jan 19, 2018 |
Printed |
Page 51
3rd paragraph |
You say "most median income values are clustered around $20,000-$50,000, but some media incomes go far beyond $60,000." However, as you mention on page 48, median income in not expressed in US dollars, e.g. "it has been scaled and capped at 15."
It would be clearer to refer to the scaled values since we don't know how they map to US dollars.
Note from the Author or Editor: Thanks for your feedback. Indeed, I forgot to mention that the median income values represent roughly tens of thousands of dollars (from 1990), so for example 3 actually represents roughly $30,000. My apologies! I updated the book to make this clear.
Hope this helps,
Aurélien
|
Anonymous |
Dec 19, 2017 |
Oct 12, 2018 |
Printed |
Page 51
2nd paragraph |
This is an erratum about the errata!
Many of the pages listed for errors in the printed version are incorrect. For example, the error that is reported as being on p. 73 (about Figure 2-8) is actually on p. 51.
Note from the Author or Editor: Thanks for your feedback. I fixed all the early errata, and sometimes this resulted in slightly longer or shorter paragraphs, so the text layout had to be adjusted. As a result, the pages mentioned in the earlier errata are slightly off (usually by a couple pages) in the latest releases. Since these errors concern only the earlier releases, we should probably keep the page numbers from these releases, don't you think? I'll talk to O'Reilly about this to see what we can do.
|
Peter Drake |
Mar 01, 2018 |
Oct 12, 2018 |
Printed |
Page 52
Figure 2-9 |
It would be nice to show the command used to generate "Figure 2-9 Histogram of income categories", perhaps in a footnote.
Note from the Author or Editor: Good suggestion, thanks. I added the line of code that plots this histogram:
housing["income_cat"].hist()
|
Anonymous |
Dec 19, 2017 |
Oct 12, 2018 |
Printed |
Page 52
2nd & 3rd Paragraph |
The 3rd paragraph at page 52, i.e:
"Let's see...
...
...
... float64"
should be placed before the second paragraph, i.e.:
"Now...
...
...
... test_index]"
Note from the Author or Editor: Thanks for your feedback. The two paragraphs should not be inverted: at the end of page 51, we have just created the income_cat attribute. Then the "Now you are ready..." paragraph creates the training set and test set (strat_train_set and strat_test_set) using stratified sampling.
Finally, we want to check whether or not stratified sampling actually respected the income category proportions of the full set. For this, we start by showing how to measure the proportions on the full set, and we explain that the same can be done to measure the proportions on the test set that we just generated.
However, I understand that it can be confusing to say "let's see if this worked" and not explicitly use what we just generated in the code example, so I will replace "housing" with "strat_test_set" in the second example code on page 52 to make things clearer, and I will replace "in the test set" with "in the full dataset" in the sentence just after the code example, like this:
"""
With similar code you can measure the income category proportions in the full dataset.
"""
Thanks for helping clarify this page!
|
Panos Kourdis |
Oct 15, 2017 |
Nov 03, 2017 |
Printed |
Page 52
1st paragraph |
Dear Aurelien, I will start by thanking you for this amazing book!! I am imbibing it like a magical stream of knowledge. Loving the examples and the writing style. My desire is to understand the hands-on part in its entirety and for that reason when facing challenges I get stuck with my brain unwilling to move past a paragraph, no matter how "insignificant" it might be in the grand scheme of ML.
Chapter 2, Create a Test Set (Ninth Release, 2018-10-12)
The passage that I would love to see clarified is: "The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil..", the hard part for me to comprehend is "divide by 1.5" (most people would either understand why; the other half would probably not even care) why was the number of "1.5" selected for division? Perhaps the text within the parentheses could be expanded to include, " in your future work you will have to pick a number that would be close to the lower cut-off value, with the lower cut-off value in our example being 2", or (if I misunderstood the meaning of 1.5) it would instead include "remember the value of 1.5 because this is the gold-standard number the universe has reserved for this purpose)
My (poor) understanding of this division by 1.5 is based on the context of the median_income that you have outlined "most median income values are clustered around 2 to 5" so you are picking the divisor as a number that is close to 2. Am I right?
I have another idea, that perhaps, it would be easier to "bin" the values per range based on [0.0, 2.0, 3.0, 4.0, 5.0] (while including the out-of-bound values) using pd.cut()?
Note from the Author or Editor: Thanks for your feedback and your kind words, I'm really glad you are enjoying my book!
I was just trying to define some useful strata. If you look at figure 2-8, you see that most incomes are between 1 and 9 (tens of thousands of dollars), with the bulk between 1.5 and 6. It seemed reasonable to define 5 strata, from 0 to 1.5, then 1.5 to 3, then 3 to 4.5, then 4.5 to 6, and finally 6 and above. By dividing the income by 1.5, rounding up and cropping above 5, that's exactly what I get. The following code is equivalent, and would probably have been clearer:
housing["income_cat"] = pd.cut(
housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
I'll clarify this paragraph. Thanks again!
|
Tim B. |
Feb 09, 2019 |
May 24, 2019 |
Printed |
Page 57
2nd paragraph |
Hi,
First of all, thanks for this great book! I have been recommending it to all my colleges who are interested in Machine Learning.
Not sure if the type of error I selected is appropriate, or if this is considered as an error at all; but on page 57 we import scatter_matrix as follows:
from pandas.tools.plotting import scatter_matrix
as of Pandas 0.20 pandas.tools.plotting has beed deprecated and pandas.plotting should be used instead.
Note: I'm running jupyter within the tensorflow/tensorflow:latest-py3 docker container, which comes with latest most common data science python libs already installed.
Reference: https://hub.docker.com/r/tensorflow/tensorflow/
Note from the Author or Editor: Hi Gabriel,
Thanks for your very kind words, I'm glad you are enjoying my book.
Indeed, the scatter_matrix() function was moved in Pandas 0.20. I updated both the Jupyter notebook and the book.
Thanks for your feedback,
Aurélien
|
Gabriel Nieves Ponce |
Nov 11, 2017 |
Jan 19, 2018 |
Printed |
Page 66
Middle in the page |
(1st Edition)
Last sentence of the paragraph below a code block.
"The names can be anything you like."
But actually step name can't include double underscore(__). :-)
Note from the Author or Editor: Indeed, the only constraint is that it should not contain double underscores, thanks for pointing it out.
|
Haesun Park |
May 21, 2017 |
Jun 09, 2017 |
Printed |
Page 66
Code sample |
For the custom transformer, the variable for the index of the households feature is named "household_ix". For consistency, I recommend it be named to "households_ix", since the other indices match the pluralization of their respective features (rooms_ix and bedrooms_ix).
Note from the Author or Editor: Good point, thanks. I updated the code to replace household_ix with households_ix.
|
Charley Grossman |
Oct 25, 2018 |
Dec 07, 2018 |
Printed |
Page 67
|
In the forth paragraph it starts "Now it would be nice if we could feed a Pandas DataFrame directly into our pipeline". This could be confusing because this is actually what we just did a few lines above when we called num_pipeline.fit_transform(housing_num), because housing_num is a Pandas DataFrame. Could be reworded/clarified a bit.
Note from the Author or Editor: Good point! What I meant is that it would be nice to be able to pass a Pandas DataFrame containing non-numerical attributes directly into our pipeline. I'll correct the sentence accordingly, thank you very much for your feedback.
|
Michael Padilla |
Oct 11, 2017 |
Nov 03, 2017 |
PDF |
Page 67
2nd paragraph |
The text reads: "Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance." I think that it should be "..it divides by the standard deviation..". According to the StandardScaler source code it also divides by the standard deviation and not the variance.
Note from the Author or Editor: Good catch! Of course you are right, first subtract the mean, then divide by the standard deviation, not the variance. Thanks a lot.
|
Anonymous |
Feb 08, 2018 |
Oct 12, 2018 |
Printed |
Page 67
The warning section |
The text reads "Only then you can use them to transform the training set and the test set (and new data)." I think it makes more sense to replace the training set with validation set here?
Note from the Author or Editor: Thanks for your feedback. I was thinking something like this:
scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)
scaler.transform(X_validation)
scaler.transform(X_test)
scaler.transform(X_new)
We must only fit the scaler to the training set, but then we can use it to transform all the data (training set, validation set, test set, new data).
However, it's true that very often, we fit the training set and transform it in just one operation:
X_train_scaled = scaler.fit_transform(X_train)
But even then, we are fitting the training set and then using the fitted scaler to transform the training set. It's just that it's happening in one method call instead of two.
I'll file this as "request for clarification", as I don't think it's a mistake, but I'll try to clarify that sentence. Thanks again.
|
Mika Qvist |
May 10, 2018 |
Oct 12, 2018 |
Other Digital Version |
68
3rd paragraph |
Hello Sir,
In the 2nd chapter "End-to-end Machine Learning project" under the section "Get the data" in the subsection "Take a quick look at the data structure" , the lines read as:
"When you looked at the top 5 rows, you noticed that the values in that column were repetitive, which means that it is probably a categorical attribute ".
I believe it should read as:
"When you looked at the top 5 rows, you noticed that the values in the "ocean_proximity" column were repetitive, which means that it is probably a categorical attribute ".
It was a little difficult to spot which columns were repetitive from the book. Had to refer the jupyter notebook for spotting that column.
Note from the Author or Editor: I can see how this can be confusing, thanks for pointing it out. Yes, I replaced "in that column were repetitive" with "in the `ocean_proximity` column were repetitive".
|
Navin Kumar |
May 27, 2017 |
Jun 09, 2017 |
Printed |
Page 68
code snippet under "And you can run the whole pipeline simply:" |
Hello Mr. Geron,
In Chapter 2, in the very last bit of the 'Transformation Pipelines' section, you run the line:
>>> housing_prepared = full_pipeline.fit_transform(housing)
But when I actually attempt to run this code in the notebook, I get the following error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
I can't seem to find a fix to this error, more specifically, I'm not quite sure where it's getting 3 arguments from.
I didn't make any changes anywhere in the notebook code and have just been running through the blocks sequentially, following along with the text. Why would it all of a sudden break at this point?
Any help would be greatly appreciated!
Thank you!!
Note from the Author or Editor: Thanks for your feedback, and my apologies for the late response, I've had a very busy summer.
The LabelEncoder and LabelBinarizer classes were designed for preprocessing labels, not input features, so their fit() and fit_transform() methods only accept one parameter y instead of two parameters X and y. The proper way to convert categorical input features to one-hot vectors should be to use the OneHotEncoder class, but unfortunately it does not work with string categories, only integer categories (people are working on it, see Pull Request 7327: https://github.com/scikit-learn/scikit-learn/pull/7327). In the meantime, one workaround *was* to use the LabelBinarizer class, as shown in the book. Unfortunately, since Scikit-Learn 0.19.0, pipelines now expect each estimator to have a fit() or fit_transform() method with two parameters X and y, so the code shown in the book won't work if you are using Scikit-Learn 0.19.0 (and possibly later as well). Avoiding such breakage is the reason why I specified the library versions to use in the requirements.txt file (including scikit-learn 0.18.1). A temporary workaround (until PR 7327 is finished and you can use a OneHotEncoder) is to create a small wrapper class around the LabelBinarizer class, to fix its fit_transform() method, like this:
class PipelineFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)
I'm updating the notebook for chapter 2 to make this clear.
Thanks again for your feedback. :)
|
Anonymous |
Aug 18, 2017 |
Nov 03, 2017 |
Printed |
Page 68
2nd paragraph |
Hello,
The text says on page 68, second paragraph: "...(it also has a fit_transform method that we could have used instead of calling fit() and then transform()).
The code for the pipeline that 'it' refers to is given on the previous page as
housing_num_tr = num_pipeline.fit_transform(housing_num)
The first passage quoted above is therefore incorrect as 'fit_transform' was in fact used. It's only mildly confusing when reading :-)
Thanks,
Michael
Note from the Author or Editor: Good catch, it was indeed confusing. I fixed the sentence like this:
The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a `StandardScaler`, which is a transformer, so the pipeline has a `transform()` method that applies all the transforms to the data in sequence (and of course it also has a `fit_transform()` method, which is the one we used).
Thanks!
|
Michael Heitmeier |
Jan 03, 2019 |
Mar 08, 2019 |
Printed |
Page 69
First Code Snippet |
I was working through the End-to-End Machine Learning Project in Chapter 2 and ran into an issue with the CategoricalEncoder. I kept getting an error that I couldn't import it despite having the most recent version of python. A quick internet search revealed that they have considered no longer supporting this functionality, so I couldn't find a place to update my package with this functionality. I was able to get the code working by looking up the previous code involving the LabelBinarizer, and then using the errata on a previous post about this page. Hope you can address this in future editions.
Thanks for a great book.
- Weston Ungemach
Note from the Author or Editor: Thanks for your feedback and your kind words. One-hot encoding is a bit of a mess right now in Scikit-Learn: the LabelBinarizer is really only meant for labels, not for input features, even though it's possible to use it by hacking a bit. The CategoricalEncoder from the upcoming 0.20 version of Scikit-Learn used to work well (I copied it in my notebooks and it was fine), but there's a discussion going on right now about replacing it with another class, which may be named OneHotEncoder (replacing the existing one) or DummyEncoder, or perhaps something else. See the discussion here:
https://github.com/scikit-learn/scikit-learn/issues/10521
In the meantime, you can use the code from the notebook in chapter 2. It works well. If you need to use it in your project, just save it to a file such as categorical_encoder.py and import from that file. Then when the Scikit-Learn team decides what to do in 0.20, you can probably do a simple update of the imports, class name and parameter names, but the functionality should remain the same... I hope!
I will definitely address this in future editions, but it's hard to know in what direction they will go.
Hope this helps,
Aurélien
|
Weston Ungemach |
Mar 11, 2018 |
Oct 12, 2018 |
PDF |
Page 69
Last line of page |
room_ix should be rooms_ix
- bedrooms_per_room = X[:, bedrooms_ix] / X[:, room_ix]
----
+ bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
Note from the Author or Editor: Great catch. Thanks. This error is now fixed.
Best regards,
Aurélien
|
Miles Thibault |
Dec 18, 2016 |
Mar 10, 2017 |
ePub |
Page 72
2nd paragraph (code) |
I believe that the denominator in the equation below is incorrect. Should be dividing by households rather than population.
ERROR -> housing["rooms_per_household"] = housing["total_rooms"]/housing["population"]
CORRECT -> housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
Note from the Author or Editor: Thanks a lot for your feedback. I fixed the error, it will disappear from the electronic versions shortly, and the printed copy will not contain it.
Best regards,
Aurélien
|
Liam Culligan |
Mar 06, 2017 |
Mar 10, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 73
4th paragraph ( last line) |
Hello Sir,
In the 2nd chapter "End-to-end Machine Learning project" under the section
"Get the data" in the subsection "Create a test set" - The lines read as :
"Suppose you chatted with experts who told you that the median income is a very
important attribute to predict median housing prices. You may want to ensure that
the test set is representative of the various categories of incomes in the whole dataset.
Since the median income is a continuous numerical attribute, you first need to create
an income category attribute. Let’s look at the median income histogram more closely
(see Figure 2-9):"
The last line (see Figure 2-9) I believe should be as :
( see Figure 2-8)
Because the subsequent line says :
"Most median income values are clustered around 2-5 (tens of thousands of dollars),
but some median incomes go far beyond 6".
In Figure 2-9 , the median income is capped at 5. In Figure 2-8 the median income go beyond 6.
I am sorry the page numbers does not seem to match. Hence the long message. I am using a pre-draft version from safari online.
Note from the Author or Editor: Good catch, you are correct, thanks a lot. I fixed this. Indeed, instead of (see Figure 2-9), the text should read (see Figure 2-8).
|
Navin Kumar |
May 27, 2017 |
Jun 09, 2017 |
PDF |
Page 73
15th line |
From Scikit-learn 0.18, train_test_split is included in sklearn.model_selection.
Note from the Author or Editor: Thanks a lot for your feedback. This error is now fixed.
Best regards,
Aurélien
|
Daisuke |
Oct 31, 2016 |
Mar 10, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 83
Paragraph 1 in Implementing cross validation box |
In the print version, the cross_val_score() function is not defined until page 84. If you are working through a notebook as you go, this created a function not defined error.
"same thing as the preceding cross_val_score() code".
"preceding" should be "following"
Note from the Author or Editor: Good catch, thank you. I think the text was initially in the right order, but the "Implementing Cross-Validation" section had to be moved around for pagination reasons. I fixed the text to make things clearer:
"""
Occasionally you will need more control over the cross-validation process than what Scikit-Learn provides off-the-shelf. In these cases, you can implement cross-validation yourself; it is actually fairly straightforward. The following code does roughly the same thing as Scikit-Learn's `cross_val_score()` function, and prints the same result:
"""
|
Stephen Jones |
Apr 28, 2017 |
Jun 09, 2017 |
Printed |
Page 83
|
On page 69, 83 and 124, it is said that cross-validation can be used to validate a model.
But in method cross_validation_score() on page 83, the model itself (sgd_clf) is not evaluated at all. It is cloned to clone_clf and modified (by fit method). So the evaluated model is a new model, not the one passed into cross_validation_score.
To summarize, as per my understanding, cross-validation is used to evaluate learning algorithms and their hyperparamerers. To validate a model, we should use test set.
Thank you.
Note from the Author or Editor: Thanks for your feedback.
Indeed, you are correct, there is some ambiguity when I say "evaluate a model": in some cases I mean "evaluate the choice of model architecture & hyperparameters" and in other cases I mean "evaluate an actual trained model, with its architecture & trained parameters".
The former (cross-validation) is typically done on the training set: in K-fold CV, the training set is split into K pieces, and the same model *architecture* is trained K times (on the training set minus piece #i, for i in [1, K]), and then evaluated on the piece it was not trained on.
The latter (evaluation of a trained model) is typically done on the validation set (when not using cross-validation) for model selection (which should be called model architecture & hyperparameter selection), or on the test set (to evaluate the generalization error).
I'll see what I can do to clarify this, thanks a lot for bringing this problem to my attention.
|
Donald Zhang |
Feb 03, 2019 |
Mar 08, 2019 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 86
Code sample |
>>> precision_score(y_train_5, y_pred)
Should be
>>> precision_score(y_train_5, y_train_pred)
Note from the Author or Editor: Good catch, thanks. I tested every line of code before adding it to the book, but I guess I must have renamed this variable in the notebook at one point, and when I updated the book I missed a couple occurrences. Sorry about that!
Note that there's the same problem a few lines below:
>>> f1_score(y_train_5, y_pred)
should be:
>>> f1_score(y_train_5, y_train_pred)
I fixed these issues, but it may take a while for them to propagate to the digital version.
|
Stephen Jones |
Apr 28, 2017 |
Jun 09, 2017 |
Mobi |
Page 86
1st paragraph (Chapter 2, Frame the Problem, Paragraph 4) |
The published book states "More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). In the first chapter, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem.”
Instead of multivariate regression, it should be multiple regression. Multivariate regression is where there is more than 1 dependent variable, while multiple regression refers to more than 1 predictor/independent variable - which is this case.
Note from the Author or Editor: Oops, indeed you are right, I should have said "multiple", not "multivariate", I just fixed this.
Thanks!
|
Sean |
Nov 13, 2018 |
Dec 07, 2018 |
Printed |
Page 89
body of plot_precision_recall_vs_threshold() function |
super minor issue. the body of the plotting function sets the location of the legend to "upper left" while the image shows the legend location at "center left".
for fix, simply change:
plt.legend(loc='upper left')
to:
plt.legend(loc='center left')
PS - great book btw :)
Note from the Author or Editor: Good catch! :) Indeed, for some reason I changed the code from "center left" to "upper left" at one point, and I did not update the figure, not sure why. I'll revert to "center left", thanks for pointing this out.
Cheers,
Aurélien Géron
|
Anonymous |
Jun 17, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 93
code sample |
plt.legend(loc="bottom right")
should be
plt.legend(loc="lower right")
Note from the Author or Editor: Good catch, thanks.
|
Stephen Jones |
Apr 28, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 95
code sample |
>>> sgd_clf.classes[5]
should be
>>> sgd_clf.classes_[5]
Note from the Author or Editor: Good catch, thanks.
|
Stephen Jones |
Apr 28, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 98
code sample |
plot_digits function is not defined in the book, only in the corresponding notebook. This is problematic when working through the book only.
Note from the Author or Editor: The `plot_digits()` function is really uninteresting, it just plots an image using Matplotlib. I preferred to leave it out of the book to avoid drowning the reader in minor details. However, I agree that I should have added a note about it, for clarity. I just added the following note:
"(the `plot_digits()` function just uses Matplotlib's `imshow()` function, see this chapter's Jupyter notebook for details)"
|
Stephen Jones |
Apr 28, 2017 |
Jun 09, 2017 |
Printed |
Page 100
the last paragraph |
The last sample code on page 100,
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
>>> f1_score(y_train, y_train_knn_pre, average='marco'),
may be corrected to
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
>>> f1_score(y_multilabel, y_train_knn_pred, average='marco').
Note from the Author or Editor: Thanks for your feedback, you are absolutely right (with one minor tweak: it's "macro", not "marco"). I updated the book and the jupyter notebook.
Cheers,
Aurélien
|
Anonymous |
Jul 05, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 101
code block in bottom third of page |
In the code example, the random noise generated for the training set is overwritten with noise for the test set before it is applied:
noise = rnd.randint(0, 100, (len(X_train), 784))
noise = rnd.randint(0, 100, (len(X_test), 784))
X_train_mod = X_train + noise
X_test_mod = X_test + noise
Just switch the second and third line to get it right (as it is in the notebook on github):
noise = rnd.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = rnd.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
Note from the Author or Editor: Good catch, thanks! Indeed, it should be written as you indicate, just like in the notebook. I'm not sure how this error happened. I fixed it now (but it may take a while to propagate to the digital versions).
|
Lars Knipping |
May 09, 2017 |
Jun 09, 2017 |
Printed |
Page 103
4th exercise |
1st edition, 1st release
4th exercise in chapter 3,
https://spamassassin.apache.org/publiccorpus/
-->
https://spamassassin.apache.org/old/publiccorpus/
Note from the Author or Editor: Good catch, thanks. Indeed, the old link is now broken, it should be replaced with:
http://spamassassin.apache.org/old/publiccorpus/
I'll update the book.
Cheers,
Aurélien
|
Haesun Park |
Jul 06, 2017 |
Aug 18, 2017 |
Printed |
Page 107
1,3 |
For consistency, the greek letters θ and Θ, since they are representing a vector and matrix quantity, respectively, should be boldface. Unless there is a literature or specified in the book convention which I am missing.
Note from the Author or Editor: Thanks for your feedback. You are right that these thetas should be in bold font since they represent vectors and matrices. I actually wrote the equations in the book using LatexMath, and I did write \mathbf{\theta} or \mathbf{\Theta} everywhere (except when they represent scalars, such as \theta_0, \theta_1, and so on, but it seems that the bold font did not always show up in the rendering phase, for some reason. Try rendering \theta \mathbf{\theta} \Theta \mathbf{\Theta} using latex2png.com, and you will see that the second theta is not rendered in bold font. I suspect that not all fonts support bold font thetas, and O'Reilly used a rendering tool based on such a font.
This was partly solved by converting equations to MathML, but it's a tedious manual process, and it seems we have missed a few. I will continue to try to fix all missing bold fonts. In the meantime I hope readers will not be too confused, hopefully the text makes it clear that we are talking about vectors and matrices.
Thanks again!
|
Panos Kourdis |
Oct 20, 2017 |
Nov 03, 2017 |
Printed |
Page 109
4th paragraph |
array([[4.21509616],[2.77011339]])
should be
array([[3.86501051],[3.13916179]])
Note from the Author or Editor: Thanks for your feedback. Yes, I tried to make the Jupyter Notebooks' output constant across multiple runs, but I forgot a few "np.random.set_seed(42)" and "random_state=42" and "tf.set_random_seed(42)" here and there, so unfortunately the outputs vary slightly across multiple runs. I'm fixing the notebooks now, so that they will actually be constant, but there's no way to make them output the same thing as the first edition of the book. So I'm fixing the book so that at least the next reprints will be consistent with the (stable) notebooks. Arrrrgh...
That said, the differences are quite small in general, so although I believe it should be possible for readers to follow along despite the minor differences.
|
Anonymous |
Jun 05, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 109
Below the first code block |
1st edtion, 1st release
y = 4 + 3x_0 + Gaussian noise
Should be
y = 4 + 3x_1 + Gaussian noise
Note from the Author or Editor: Good catch, it should indeed be x_1 instead of x_0. Fixing the book now.
Thanks!
Aurélien
|
Haesun Park |
Jul 05, 2017 |
Aug 18, 2017 |
Printed |
Page 110
The first paragraph in Computational Complexity section |
The inverse of dot product of X.transpose and X is an n by n matrix. Should not it be (n+1) x (n+1)? Because X is m x (n+1) matrix.
Note from the Author or Editor: Good catch! Yes, X^T X is an (n+1) x (n+1) matrix, not n x n. Fortunately, it does not change the computational complexity of the normal equation, it's still between O(n^2.4) and O(n^3).
By the way, I rewrote part of this section for the next release because I oversimplified it: in particular, Scikit-Learn's LinearRegression class uses an algorithm based on SVD (matrix decomposition) rather than the Normal Equation: y_pred = np.linalg.pinv(X).dot(y). (this uses the Moore-Penrose pseudo inverse, which is based on SVD).
SVD has a computational complexity of O(m n^2), so it's significantly better than the Normal Equation (but it does not change the conclusions of this section: this class does not support out-of-core, training is linear with regards to the number of instances (m) but quadratic with regards to the number of features (n), so it's slow when there are very many features (e.g. for large images).
Thanks a lot for your feedback!
|
Anonymous |
Dec 21, 2017 |
Oct 12, 2018 |
Printed |
Page 110
Referring to the whole section 'The Normal Equation' |
The normal equation is stated to determine the parameters of the linear regression model:
\theta = (X^T X)^{-1} X^T y
Later, the computational complexity of calculating the inverse is mentioned. But, there is an alternative not mentioned in the book. One can determine \theta as the solution of the linear equation
(X^T X) \theta – X^T y = 0
This should be explained in the book, as well. Later, in the section “Linear regression with TensorFlow” the same mistake is made. In case this is on purpose, because the calculation shows well how to use TensorFlow, you should at least mention it.
Best regards and thank you for the great book,
Niclas
Note from the Author or Editor: Indeed, you are right, the Normal Equation is not the only way to determine the parameters of the Linear Regression model. I updated the book to also mention the alternative you propose, which leads to using the Moore-Penrose pseudoinverse of X. This in turn requires computing the SVD of X (see chapter 8 for the SVD). This solution is both faster to compute, and it supports collinear data (e.g., a dataset where one or more features are linear combinations of other features), which the normal equation does not support. This is what Scikit-Learn actually uses. Thanks for your suggestion!
|
Niclas von Caprivi |
Aug 24, 2018 |
Dec 07, 2018 |
Printed |
Page 118, 119
Last line page 118. Label for figure 4-10 |
The online code suggest that to be first 20 steps of SGD not first 10 steps
Note from the Author or Editor: You're right, it's the first 20 steps, not the first 10 steps. Fixed, thanks! :)
|
Calvin Huang |
Jan 05, 2018 |
Oct 12, 2018 |
Printed |
Page 118
First full paragraph |
You refer to the process of gradually reducing the learning rate as simulated annealing.
Other sources use this term to refer to an algorithm that occasionally makes "uphill" moves (with a probability decreasing over time).
I see the analogy, but I think you're using the terminology in a nonstandard way here.
Note from the Author or Editor: Thanks for your feedback. Indeed, it's an analogy, not an identity. I updated the sentence like so:
This process is akin to simulated annealing, an algorithm inspired from the process of annealing in metallurgy where molten metal is slowly cooled down.
For a more detailed explanation of the link between gradient descent using a learning schedule and simulated annealing, see:
http://leon.bottou.org/publications/pdf/nimes-1991.pdf
|
Peter Drake |
Mar 09, 2018 |
Oct 12, 2018 |
Printed |
Page 125
Code |
When defining the polynomial_regression pipeline, there's a typo...it's currently Pipeline(( ....)) while it I believe it should be Pipeline([ ...]).
Great book. Thank you!
Michael
Note from the Author or Editor: Thanks Michael, indeed Pipelines take lists of tuples, not tuples of tuples. Previous versions of Scikit-Learn would actually accept both, hence the fact that I did not catch this error earlier, but version 0.19 has become strict.
|
Michael Padilla |
Oct 20, 2017 |
Nov 03, 2017 |
Printed |
Page 134
Code snippet at the top |
An omission...in this code you refer to an ndarray called X_train_poly_scaled that isn't defined anywhere (thought it's easy to figure out what it should be). In your notebook it's defined naturally as
poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler()),
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
but this isn't given in the book although it's referred to. Not a biggie, but thought you should know. Thanks!
Note from the Author or Editor: Thanks for your feedback. Indeed, I often left some code details out of the book in order to keep it short and focused, but perhaps sometimes I went a bit too far. In this particular case, I think you are right that I should at least say that the data is extended with polynomial features and then scaled (or I should add the few lines of code that define X_train_poly_scaled and X_val_poly_scaled). Since the code example is meant to illustrate early stopping, I'd like to keep it focused so I think I'll go for the first option (a quick explanation in the text).
Thanks a lot!
|
Michael Padilla |
Oct 23, 2017 |
Nov 03, 2017 |
Mobi |
Page 137.1
second code block |
- >>> some_data_prepared = preparation_pipeline.transform(some_data)
------
+ >>> some_data_prepared = full_pipeline.transform(some_data)
Note from the Author or Editor: Thanks for your feedback. This error is now fixed.
Best regards,
Aurélien
|
Michael Ansel |
Jan 15, 2017 |
Mar 10, 2017 |
Printed |
Page 144
Fit 4-25 |
The image is missing entirely - I have found the same problem in several places
P 139 fig 4-22
P 149 fig 5-4
P 224 fig 8-12
P. 296 fig 11-5
P 300 fig 11-6
First edition fourth release
Note from the Author or Editor: Thanks for your feedback. Yikes! This is bad, I'm really sorry about this. I have reported this problem to O'Reilly, I will get back to you ASAP.
|
Kelly McDonald |
Dec 25, 2017 |
Jan 19, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 147
last sentence |
1st edition, 1st release
In a paragraph below Figure 5-4,
"...(using the LinearSVC class with C=0.1 and the hinge loss..."
should be
"...(using the LinearSVC class with C=1 and the hinge loss..."
Note from the Author or Editor: Good catch, it should indeed be C=1, not C=0.1. I fixed the book.
Thanks!
Aurélien
|
Haesun Park |
Jul 07, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 148
last line of first code sample |
svm_clf.fit(X_scaled)
should be
svm_clf.fit(X)
The pipeline is performing the scaling and there is no "X_scaled" variable elsewhere in the sample.
Note from the Author or Editor: Good catch, thanks. Indeed, it should be:
svm_clf.fit(X)
rather than:
svm_clf.fit(X_scaled)
I tested every code example in the book, but it seems that a few times I updated the notebooks and forgot to update the book. I just wrote a script to compare the code in the notebooks with the code examples in the book, and I'm currently going through every chapter to fix the little differences. This is one of them.
|
Adam Chelminski |
May 24, 2017 |
Jun 09, 2017 |
Printed |
Page 148
first set of code |
This code returns an error:
iris = datasets.load_iris()
X=iris["data"][:,(2,3)] #only petal length and width
y=(iris["target"]==2).astype(np.float64) #import only IrisVirginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1,loss="hinge")),
))
svm_clf.fit(X,y)
error:
~\Miniconda3\envs\MyEnv\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
224 # transformer. This is necessary when loading the transformer
225 # from the cache.
--> 226 self.steps[step_idx] = (name, fitted_transformer)
227 if self._final_estimator is None:
228 return Xt, {}
TypeError: 'tuple' object does not support item assignment
Note from the Author or Editor: Thanks for your feedback.
The code actually works fine up to Scikit-Learn 0.18, but then in Scikit-Learn 0.19 (which did not exist when I wrote the book), Pipelines must now be created with a list of tuples instead of a tuple of tuples. I updated the Jupyter notebooks to ensure that the code now works with Scikit-Learn 0.19. Basically, use this code instead (note the square brackets):
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1,loss="hinge")),
])
Cheers,
Aurélien
|
Justin |
Sep 21, 2017 |
Nov 03, 2017 |
Printed |
Page 148
Figure 5-1 |
The image for figure 5-1 is missing.
Also figure 14-8 p397 the image is missing.
Can also confirm the missing figure images as reported by Kelly Dec. 25 2017.
First edition fourth release.
Note from the Author or Editor: Hi David,
Thanks a lot for your feedback, I am so sorry about these issues. I wish I had seen your message earlier, but I was in vacation, my apologies for the delay. I have reported the problem to O'Reilly, I will get back to you ASAP. I'm sure they will fix the problem very quickly.
Aurélien
Edit: the problem is now fixed. If you purchased the book via Amazon, O'Reilly told me that you can request a replacement copy, and it will be sent to you free of charge. Again, I'm really sorry about this problem, and I hope you are enjoying the book despite this issue.
|
David Thomas |
Jan 04, 2018 |
Jan 19, 2018 |
Printed |
Page 148
top of page |
Missing figures:
5-1
5-4
Also figures:
4-22
4-25
Note from the Author or Editor: Thanks for your feedback. This problem was due to a printer error in December 2017, and it was corrected in a reprint in January 2018; an O'Reilly representative will contact you for more details about the copy you have.
|
Edison de Queiroz Albuquerque |
Mar 26, 2018 |
Apr 04, 2018 |
Printed |
Page 151
Equation 5-1 |
1st edition, 1st release
In equation 5-1,
I suggest
\phi_{\gamma} (x, l) = ... or \phi (x, l) = ...
is better than
\phi \gamma (x, l) = ...
Note from the Author or Editor: Good catch, thanks. I fixed this a few weeks ago, it should be okay in the next reprints.
|
Haesun Park |
Jul 10, 2017 |
Aug 18, 2017 |
PDF, Mobi |
Page 151
last paragraph |
In chapter 5, toy dataset moons is not introduced. First apparition of moons in the phrase
"Let’s test this on the moons dataset" yet no clarification before or after about what make_moons call makes.
Didn't check the code examples (maybe some doc string there) yet if you are not in the computer is difficult to follow.
As you enjoy the problems and the solutions, just adding something like
"The make_moon function creates a set of data points with the shape of two interleaving circles. Check sklearn documentation for more information."
could help a lot.
Cheers,
JJ.
Note from the Author or Editor: Thanks for your suggestion. Indeed, I just pointed to the figure 5-6 where the dataset is represented, but this was not enough. I added the following sentence:
The `make_moons()` function creates a toy dataset for binary classification: the data points are shaped as two interleaving half circles as you can see in figure 5-6.
Thanks again!
Aurélien
|
Joaquin Bogado |
Feb 08, 2018 |
Oct 12, 2018 |
Printed |
Page 151
Code at bottom |
Code at the bottom of the page should include the call to make_moons(); otherwise X and y will be fit from the iris dataset when working through the chapter in order.
Note from the Author or Editor: Good catch, thanks!
Indeed, the following line was missing in the code:
X, y = make_moons(n_samples=100, noise=0.15)
I just fixed this. Thanks again.
|
Anonymous |
Jul 18, 2019 |
|
Printed |
Page 160
5th line of that page |
The value of vector b is supposed to be -1 I think. Since you introduced -1*t to A, if the vector of b is made of 1s then the whole formula after substitution will be
t(wx+b) >= -1, according to my calculation.
And I really think some part of the dot product and matrix-vector multiplication is messed up. Is this convention in machine learning to use dot product represent matrix multiplication?
Note from the Author or Editor: Great catch! The vector b should be full of -1 instead of 1.
The constraints are defined as: p^T a^(i) <= b^(i), for i=1, 2, ..., m
If b^(i) = -1, we can rewrite the constraints as: p^T a^(i) <= -1
Since a^(i) = -t^(i) x^(i), the constraints are: -t^(i) p^T x^(i) <= -1
Which we can rewrite to: t^(i) p^T x^(i) >= 1
For positive instances, t^(i) = +1, and for negative instances t^(i) = -1.
So for positive instances: p^T x^(i) >= 1, which is what we want.
For negative instances: -p^T x^(i) >= 1, therefore: p^T x^(i) <= -1, which is also what we want.
Thanks a lot for your feedback, I fixed the error for the next release.
|
Calvin Huang |
Jan 07, 2018 |
Oct 12, 2018 |
Printed |
Page 162
1st paragraph |
(1st edition, 5th release)
"The resulting vector p will contain the bias term b = p_0 and the feature weights w_i = p_i for i = 1, 2, ⋯, m"
But p is (n+1) dimensional vector not (m+1), so it should be "for i = 1, 2, ⋯, n"
Thanks.
Note from the Author or Editor: As always, you are right, Haesun, thanks a lot. Fixed (replaced m with n).
|
Haesun Park |
Jan 30, 2018 |
Oct 12, 2018 |
Printed |
Page 162
euqation 5-9 |
I believe here the linear transformation and dot product is mistaken from a math perspective. It's supposed to be a dot b not a transpose dot b.
Note from the Author or Editor: Thanks for your feedback. Indeed, it's probably better to replace `a^T b` with `a.b` in this section. In many cases in Machine Learning, it's more convenient to represent vectors as column vectors (i.e., 2D arrays with a single column), so they can be transposed, used like matrices, and so on. Of course if `a` and `b` are column vectors, then `a^T b` is a 2D array containing a single cell whose value is equal to the dot product of the (1D) vectors corresponding to `a` and `b`. In other words, the result is identical, except for the dimensionality: if `a` and `b` are regular vectors, then `a.b` is a scalar, but if `a` and `b` are column vectors, then `a^T b` is one-cell matrix. For example:
>>> import numpy as np
>>> np.array([2,3]).dot(np.array([5,7])) # a.b
31
>>> np.array([[2],[3]]).T.dot(np.array([[5],[7]])) # a^T b
array([[31]])
I plan to cleanup the whole book regarding this issue, not just chapter 5 (but it may take a bit of time).
Thanks again!
|
Calvin Huang |
Jan 06, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 164
Hinge Loss |
In the last sentence of Hinge Loss, "using any subderivative at t = 0" should be "using any subderivative at t = 1".
Note from the Author or Editor: Good catch! It should indeed be "any subderivative at t=1" instead of "any subderivative at t=0".
Thanks!
|
Hiroshi Arai |
May 30, 2017 |
Jun 09, 2017 |
Printed, |
Page 166
Equation 5-12 |
1st Edtion 5th Release.
In eq. 5-12, "1-t^{(i)}\hat{w}^T" should be "t^{(i)}-\hat{w}^T" like eq. 5-7.
Thanks.
Note from the Author or Editor: Great catch Haesun, thanks a lot. Indeed, the equation should contain t^{(i)} - \hat{w}^T (three times). Below is the corrected MathML code:
<math xmlns="http://www.w3.org/1998/Math/MathML" mode="display">
<mtable displaystyle="true">
<mtr>
<mtd columnalign="right">
<mover accent="true"><mi>b</mi> <mo>^</mo></mover>
</mtd>
<mtd columnalign="left">
<mrow>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>-</mo>
<msup><mrow><mover accent="true"><mi mathvariant="bold">w</mi> <mo>^</mo></mover></mrow> <mi>T</mi> </msup>
<mo>·</mo>
<mi>ϕ</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mfenced>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>-</mo>
<msup><mrow><mfenced separators="" open="(" close=")"><munderover><mo>∑</mo> <mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow> <mi>m</mi> </munderover><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><msup><mi>t</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mi>ϕ</mi><mrow><mo>(</mo><msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mo>)</mo></mrow></mfenced></mrow> <mi>T</mi> </msup>
<mo>·</mo>
<mi>ϕ</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mfenced>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd/>
<mtd columnalign="left">
<mrow>
<mo>=</mo>
<mstyle scriptlevel="0" displaystyle="true">
<mfrac><mn>1</mn> <msub><mi>n</mi> <mi>s</mi> </msub></mfrac>
</mstyle>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mfenced separators="" open="(" close=")">
<msup><mi>t</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>-</mo>
<munderover><mo>∑</mo> <mfrac linethickness="0pt"><mstyle scriptlevel="1" displaystyle="false"><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow></mstyle> <mstyle scriptlevel="1" displaystyle="false"><mrow><msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup><mo>></mo><mn>0</mn></mrow></mstyle></mfrac> <mi>m</mi> </munderover>
<mrow>
<msup><mrow><mover accent="true"><mi>α</mi> <mo>^</mo></mover></mrow> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<msup><mi>t</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<mi>K</mi>
<mrow>
<mo>(</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow> </msup>
<mo>,</mo>
<msup><mi mathvariant="bold">x</mi> <mrow><mo>(</mo><mi>j</mi><mo>)</mo></mrow> </msup>
<mo>)</mo>
</mrow>
</mrow>
</mfenced>
</mrow>
</mtd>
</mtr>
</mtable>
</math>
|
Haesun Park |
Mar 01, 2018 |
Oct 12, 2018 |
Printed |
Page 172
Paragraph right before sectoin Computational Complexity |
It seems that the following paragraph should be part of the caution (scorpion) section which discusses greed algorithms and its reasoning, instead of part of the main text:
"Unfortunately, finding the optimal tree is known to be an NP-Complete problem:2 it requires O(exp(m)) time, making the problem intractable even for fairly small training sets. This is why we must settle for a “reasonably good” solution."
Note from the Author or Editor: That's a good point: I moved this sentence into the caution section.
Thanks for your feedback,
Aurélien
|
Jiaqi Liu |
Apr 15, 2018 |
Oct 12, 2018 |
PDF, |
Page 175
Eq. 6-3 |
In Eq. 6-3, natural logarithm is used, but entropy for information gain use binary logarithm. scikit-learn does too(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_utils.pyx#L86).
So, I recommend you to change eq. 6-3 to log_2(..) not log(..), and fig 6-1's entropy calculation to 0.445 not 0.31.
Thanks.
Note from the Author or Editor: Good point, I just fixed this mistake, thanks a lot!
Note that it does not change the resulting tree, since the value of x that maximizes a function f(x) also maximizes f(x)/log(2) (where "log" denotes the natural logarithm).
Entropy originated in thermodynamics, where the natural log is used. It later spread to other domains, including Shannon's information theory, where the binary log is used, and therefore the entropy can be expressed as a number of bits. In TensorFlow, the softmax_cross_entropy_with_logits() function uses the natural log rather than the binary log. Its value is just used for optimization (the optimizer tries to minimize it), so it does not matter whether they use the binary log or the natural log. If you wanted to get a number of bits, you would have to divide the result by log(2).
By the way, if you are interested, I did a video about entropy, cross-entropy and KL-divergence: https://youtu.be/ErfnhcEV1O8
Thanks again,
Aurélien
|
Haesun Park |
Mar 25, 2018 |
Oct 12, 2018 |
Printed |
Page 188
the paragraph before the Random Patches and Random Subspaces |
Original phrase
Page 188: Chapter 7: Ensemble Learning and Random Forests
"has a 60.6% probability of belonging to the positive class (and 39.4% of belonging to the positive class):"
There are the word "positive class" two times. If 39.4% is the probability to be in the positive class, I think 100 - 39.4% which is 60.6 should be the probability to be in the negative class.
Which number is for negative and which one is for positive class, then? Please help, thank you.
Note from the Author or Editor: Good catch, thanks! Indeed, the sentence should be:
"""
For example, the oob evaluation estimates that the first training instance has a 68.25% probability of belonging to the positive class (and 31.75% of belonging to the negative class):
"""
|
Ekarit Panacharoensawad |
Jul 13, 2017 |
Aug 18, 2017 |
Printed |
Page 193
Figure 7-8 |
1st edition,
In figure 7-8, titles are learning_rate = 0 and learning_rate = -0.5
I think that learning_rate = 1 and learning_rate = 0.5
Why you use learning_rate - 1 for title?
Thanks
Note from the Author or Editor: Nice catch, that's indeed a mistake. I just fixed it, future reprints and digital editions will be better thanks to you! :)
|
Haesun Park |
Sep 01, 2017 |
Nov 03, 2017 |
Printed, ePub |
Page 193
Equation 7-1 |
In the definition of r_j, the denominator is given as the sum of the weights, but this sum is always 1. The weights are initialized so they sum to one (just before equation 7-1), and then normalized again after any update (just below equation 7-3) so they again sum to one.
Note from the Author or Editor: Thanks for your feedback. Indeed, the denominator is always equal to 1, so I could remove it in Equation 7-1. I remember hesitating to do so, but I chose not to because I wanted to show that r_j represents the weighted error rate, and when people read "rate", I think they except a numerator and a denominator. However, I think I will add a note saying that the denominator is always equal to 1.
Cheers,
Aurélien
|
Glenn Bruns |
Sep 16, 2017 |
Nov 03, 2017 |
Printed |
Page 213
Equation 8-1 |
(1st Edition)
In Equation 8-1, V^T should be V.
This is often confused, because svd() function actually returns V^T, not V
So, I suggest to change code below Eq 8-1
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
-->
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]
In next page(p314) first sentence,
"the maxtix composed of the first d columns of V^T"
should be
"the maxtix composed of the first d columns of V"
Thanks
Note from the Author or Editor: Good catch, thanks!
Here is the list of changes I just made, which includes your list plus a couple more fixes:
* Top of page 213, "where V^T contains all the principal components" was changed to "where V contains all the principal components".
* Equation 8-1: replaced V^T with V.
* In all code examples, replace V with Vt. This includes 3 replacements in the code on page 213, and 1 replacement in the first code example on page 214.
* Top of page 214: "the matrix composed of the first d columns of V^T" was changed to "the matrix composed of the first d columns of V".
I also updated the corresponding notebook, and added a comment to explain the issue.
Thanks again! :)
|
Haesun Park |
Sep 14, 2017 |
Nov 03, 2017 |
Printed |
Page 223
Multiple sentences |
(1st edition)
I think w_{i,j} is not unit vector, so \hat does not need.
Also, LLE equation can be presented by l2 norm square, but just absolute square is more common.
Thanks.
Note from the Author or Editor: Interesting question. The \hat in this context indicates that the weights are the result of a first optimization (that of Equation 8-4). It does not mean that we are talking about a unit vector. So I would rather leave them in place on this page because I think it helps understand which parts of Equation 8-5 are constant (i.e., the weights \hat{w}_{i,j}) and which parts are not (i.e., the positions of the instances in the low-dimensional space, z^(i)).
However, I agree that the l2 norm is unnecessary since we are computing the square, and of course ||v||^2 is the same as v^2. I'll replace the double vertical lines (||) with parentheses.
Thanks!
|
Haesun Park |
Oct 07, 2017 |
Nov 03, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 235
2nd paragraph |
You have "y depends on w, which depends on x".
I believe y depends on x, which depends on w.
Note from the Author or Editor: Good catch, thanks again Peter. :)
Indeed the sentence should read:
TensorFlow automatically detects that y depends on x, which depends on w, so it first evaluates w, then x, then y, and returns the value of y.
|
Peter Drake |
May 25, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 236
2nd paragraph |
In the Normal Equation, the left parenthesis to the left of the theta should be moved to right of the equal sign.
Note from the Author or Editor: Thanks for your feedback!
Indeed, there's a problem with the parentheses in this sentence. However the problem is actually that there is an opening parenthesis missing on the right hand side of the = sign. The text should look like this (except with nice math formatting):
[...] corresponds to the Normal Equation (theta_hat = (XT . X)-1 . XT . y; see Chapter 4).
I fixed the error and pushed it to production (the digital versions should be updated within a couple weeks).
Thanks again!
|
Peter Drake |
May 25, 2017 |
Jun 09, 2017 |
Printed |
Page 237
code block at bottom |
In the code block at the bottom of page 237, you use tf.reduce_mean without having explained what that function does. It was easy enough to look up in the TensorFlow documentation, but it would have been helpful to have the explanation in the text, and I assume you intended to explain reduce_mean in the list of brief explanations of newly-introduced functions (e.g. tf.assign) just above the code block.
Thanks.
Note from the Author or Editor: Thanks for your feedback. I initially thought that this function would be self-explanatory, given its name and the fact that it is used to compute the mean of the squared error, but I agree that the name can actually be confusing: it is somewhat unfortunate that they didn't just name it "mean()" instead of "reduce_mean()", as it's really analogous to NumPy's mean() function. To clarify this, I added the following line:
---
* The `reduce_mean()` function creates a node in the graph that will compute the mean of its input tensor, just like NumPy's `mean()` function.
---
I hope this helps.
|
Jeff Lerman |
Jan 09, 2018 |
Oct 12, 2018 |
Printed |
Page 245
last sentence |
1st edition.
In last sentence, "inside the loss namespace, ..." should be "inside the loss namescope, ..."
Note from the Author or Editor: Thanks a lot, it's a typo. I fixed it now. :)
|
Haesun Park |
Oct 04, 2017 |
Nov 03, 2017 |
PDF |
Page 246
2nd paragraph and code example |
tf.global_variable_initializers()
should be
tf.global_variables_initializer()
Note from the Author or Editor: Great catch, thanks! This error is now fixed, it was a failed find&replace, when the method `initialize_all_variables()` got renamed to `global_variables_initializer()`.
Best regards,
Aurélien
|
ken bame |
Feb 26, 2017 |
Mar 10, 2017 |
Other Digital Version |
248
4 |
"Zeta is the 8th letter of the Greek alphabet"
It is the 6th letter of the Greek alphabet.
Note from the Author or Editor: Indeed, Zeta is the 6th letter of the Greek alphabet, thanks!
|
Oliver Dozsa |
Oct 26, 2017 |
Nov 03, 2017 |
Printed |
Page 252
Excercise 12. fourth bullet |
(1st edition)
In chapter 9 ex. 12, fourth bullet says "... using nice scopes...".
I think it's typo of "... using name scopes...".
Note from the Author or Editor: Indeed, this sentence should read "name scopes" instead of "nice scopes". Thanks!
|
Haesun Park |
Oct 07, 2017 |
Nov 03, 2017 |
Printed |
Page 257
The perceptron paragraph |
In neural network literature, the artificial neuron in perceptron model is usually called Threshold Logic Unit (TLU). TLU is more common than LTU.
Note from the Author or Editor: Thanks for your feedback, indeed it seems that TLU is more common than LTU.
I tried to use "googlefight.com" to settle the dispute, but it failed, so I did a manual check:
* Google search for "threshold logic unit": 21,400 results.
* Google search for "linear threshold unit": 7,890 results.
So TLU wins hands down! :)
I also searched on Google's ngram viewer, and a few references to the TLU have been seen in various books, while there was no reference to LTU.
So I updated chapter 10 and the index to use Threshold Logic Unit rather than Linear Threshold Unit.
Thanks again,
Aurélien
|
Anonymous |
Mar 09, 2018 |
Oct 12, 2018 |
Printed |
Page 260
graph |
Is All the weight on the graph? I can't find myself understand what you mean by the graph.
Note from the Author or Editor: Thanks for your question. The numbers on Figure 10-6 represent the connection weights. For example, if the network gets (0,0) as input (so x1=0 and x2=0), then neuron in the middle of the hidden layer will compute -1.5*1 + 1*x1 + 1*x2 = -1.5, which is negative so it will output 0. The neuron on the right of the hidden layer will compute -0.5 * 1 + 1 * x1 + 1 * x2 = -0.5, which is negative so it will also output 0. Finally, the output neuron at the top will compute -0.5*1 + -1*0 + 1*0 = -0.5, so the final output of the network will be 0. Indeed, 0 XOR 0 = 0, so far so good.
If we try again with inputs (1, 1), we get the following computations (considering the neurons in the same order):
-1.5*1 + 1*1 + 1*1 = 0.5 => output 1
-0.5*1 + 1*1 + 1*1 = 1.5 => output 1
-0.5*1 - 1*1 + 1*1 = -0.5 => final output 0
Again, this is good because 1 XOR 1 = 0.
If we try again with inputs (0, 1), we get the following computations (again, considering the neurons in the same order):
-1.5*1 + 1*0 + 1*1 = -0.5 => output 0
-0.5*1 + 1*0 + 1*1 = 0.5 => output 1
-0.5*1 - 1*0 + 1*1 = 0.5 => final output 1
Great, that's what we wanted: 0 XOR 1 = 1.
Lastly, we can try again with inputs (1, 0), and we get the following computations:
-1.5*1 + 1*1 + 1*0 = -0.5 => output 0
-0.5*1 + 1*1 + 1*0 = 0.5 => output 1
-0.5*1 - 1*0 + 1*1 = 0.5 => final output 1
Again, that's what we wanted: 1 XOR 0 = 1.
So this network does indeed solve the XOR problem, using the weights indicated on the diagram. I'll add a note to clarify the fact that the numbers on the diagram represent the connection weights.
I hope this is clearer.
Cheers,
Aurélien
|
calvin huang |
Jan 13, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 263
line 5 |
The sentence 'The softmax function was introduced in Chapter 3." is incorrect; the softmax function was introduced in Chapter 4 (p. 139 of the print edition).
Note from the Author or Editor: Good catch! Indeed, the softmax function was introduced in chapter 4, not 3 (in my first draft, it was introduced in chapter 3, hence the mistake).
Thanks a lot!
Aurélien
|
Glenn Bruns |
Jul 05, 2017 |
Aug 18, 2017 |
Printed |
Page 264
3rd code paragraph |
Code example:
>>>dnn_clf.evaluate(X_test,y_test)
Doesn't supported
should be
>>>dnn_clf.score(X_test,y_test)
instead.
Note from the Author or Editor: Thanks for your feedback. The code works fine in TensorFlow 1.0, but it breaks in TensorFlow 1.1, because TF.Learn's API was changed significantly. I noticed this a while ago and I updated the book accordingly (I removed the paragraph about evalution because TF.Learn seems to be a moving target), so this problem only affects people who have the first revision of the book and are using TF 1.1+.
Cheers,
Aurélien
|
Yevgeniy Davletshin |
Jul 05, 2017 |
Aug 18, 2017 |
Printed |
Page 266
middle of the page, and the first line of the code |
In p.266, the std of 2/sqrt(n_input) is used to help the algorithm converge faster.
However, from the explanation in chapter 11 (p.278), it seems like it is only true when n_input and n_output are roughly same and the activation function is Hyperbolic tangent.
Note from the Author or Editor: Great catch, thanks. I should have written 2/sqrt(n_inputs+n_neurons) or sqrt(2/n_inputs). This is He Initialization, to be used with the ReLU activation function (the latter would be okay when n_inputs is equal or close to n_outputs). In practice, for shallow networks (such as the ones in chapter 10) it's not a big deal if initialization is not perfect. It's much more important for deep nets.
I'll fix chapter 10, thanks again for your contribution!
|
Joshua Min |
Aug 16, 2017 |
Nov 03, 2017 |
Printed |
Page 268
Note |
(First Edition)
In Note, "... corner case like logits equal to 0."
I think that corner case softmax's output equal to 0 or logits far less than 0.
In cross entropy p*log(q), as you may know, q is softmax's output.
Note from the Author or Editor: Good catch! I replaced this sentence with this: "[...] and it properly takes care of corner cases: when logits are large, floating point rounding errors may cause the softmax output to be exactly equal to 0 or 1, and in this case the cross entropy equation would contain a log(0) term, equal to negative infinity. The `sparse_softmax_cross_entropy_with_logits()` function solves this problem by adding a tiny epsilon value to the softmax output.".
Thanks Haesun!
|
Haesun Park |
Oct 21, 2017 |
Nov 03, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 269
2nd paragraph |
"one mini-batches" should be "one mini-batch"
Note from the Author or Editor: Good catch, thanks. I fixed the error.
|
Peter Drake |
May 25, 2017 |
Jun 09, 2017 |
Printed |
Page 269
Last paragraph |
"...the code evaluates the model on the last mini-batch and on the full training set, and..."
should read
"...the code evaluates the model on the last mini-batch and on the full test set, and..."
|
Adam Chelminski |
May 31, 2017 |
Jun 09, 2017 |
Printed |
Page 269
last code block |
(First Edition)
In execution phase, training loop uses mnist.test data.
As you may know, it's not good practice.
I suggest to change it to mnist.validation for most readers and evaluate test set after for-epoch loop.
Best,
Haesun. :)
Note from the Author or Editor: Thanks for your feedback. For a second I thought you were saying that I trained the model on the test set! :)
The training loop uses mnist.train for training, and shows the progress by evaluating the model on the test set. I agree with you that it would be better to use the validation set for this purpose. I'm updating the notebook and the book.
|
Haesun Park |
Oct 26, 2017 |
Nov 03, 2017 |
Printed, PDF, ePub |
Page 278
Table 11-1 |
You are introducing the initialization scheme, Xavier and He's, for both of Uniform[-r, r], and Normal(0, sigma^2).
1. I think the order of listing is reversed between logistic and tanh.
2. This is a minor issue of typeset, but it keeps confusing me that,
the number '4' in front of the initialization factors of 'Hyperbolic Tangent' (current unfixed version) looks like the fourth root. Could you increase its size a little bit more in next revision?
3. This is a question as a new beginner in this field.
He, et al. commented in their paper [arXiv:1502.01852] that
"We note that it is sufficient to use either Eqn.(14) or Eqn.(10) alone. For example, if we use Eqn.(14), then the product in Eqn.(13), the product (...)=1, and in Eqn.(9) the product (...) =c2/dL , which is not a diminishing number in common network designs. This means that if the initialization properly scales the backward signal, then this is also the case for the forward signal; and vice versa."
My question here is on the compromised version of your table 11-1 for ReLU. Instead of using just either n_in or n_out, what benefit does (n_in+n_out)/2 give me? Numbers in the original version exactly cancels along the whole tower of layers, and gives an exactly stable (fixed) variance of gradients (or inputs, depending on the choice bet. n_in and n_out).
I think this makes a big difference when the size of layers changes a lot. I am just a beginner so I have no idea how many times I encounter such cases in real problems. How is the geometric mean as an alternative?
Cheers,
Note from the Author or Editor: Thanks for your feedback!
1. You are right, I inverted the equations for Logistic and Hyperbolic Tangent, I just fixed this. Great catch!
2. I'm not sure how to increase the size of the font of the number 4, but I added a small space between it and the square root, hopefully it will avoid confusion.
3. That's a good question, I'm not sure whether using (n_in+n_out)/2 or just n_in or n_out is preferable. My intuition is that the former is better, but I don't have data to back that up, it would be interesting to run some experiments. I might try that when I get the chance.
|
Doyoun Kim |
Nov 12, 2018 |
Dec 07, 2018 |
PDF |
Page 279
2nd Paragraph |
Minor grammar issue that you might want to fix in the 2nd paragraph ('Nonsaturating Activation Functions'.)
.., it will start outputting 0. When this happen, the neuron ...
should be
.., it will start outputting 0. When this happens, the neuron ...
Thanks for a thoroughly enjoyable and informative book!
Note from the Author or Editor: Nice catch, thanks! I just fixed this, future reprints and digital editions should be fine.
|
Vineet Bansal |
Aug 21, 2017 |
Nov 03, 2017 |
Printed |
Page 281
Book 2nd release, 3rd list bullet |
The assertion
"the function is smooth everywhere, including around z = 0"
is only true if alpha = 1.
Note from the Author or Editor: Good point, you are absolutely right. I corrected this sentence like this:
Third, if alpha is equal to 1 then the function is smooth everywhere, including around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.
Thanks for your feedback!
|
Paolo Baronti |
Jan 10, 2018 |
Oct 12, 2018 |
Printed |
Page 288
the code for reusing variables |
In "reuse_vars_dict", var.name was repeated twice instead of (var.op.name, var) as was shown in the jupyter notebook, but more importantly I think this line is redundant since feeding the saver with "reuse_vars" will lead to the same result: the new model will use of the variables in the hidden layers 1-3 under their old names.
Note from the Author or Editor: Thanks for your feedback! I actually fixed this error a few months ago, so the latest releases contain (var.op.name, var) instead of (var.name, var.name). However, I did not realize that I could just get rid of this line, that's nice! I just did, both in the book and in the Jupyter notebook.
Thanks again!
Aurélien
|
Anonymous |
Mar 21, 2018 |
Oct 12, 2018 |
Printed |
Page 295
Equation 11-5 |
I think the equation description of nestrov accelerated gradient is not correct.
Shortly speaking, the sign of eq. 1 \theta+\beta m inside of gradient is wrong.
Long version:
Under strongly convexity assumption, the Nestrov acceleration can be viewed as the incremental version of momentum acceleration. If puttin the 1 and 2 equation in the book together, you will get:
\theta = \theta - \beta m - \eta \nabla J (\theta + \beta m)
Noticing the mismatch between (\theta - \beta m) and (\theta + \beta m) in the gradient.
Because according to the author notation, m is the accumulated estimation of gradient (Not NEGATIVE gradient), therefore the true gradient estimated should be at \theta - \beta m. Thus, in my opinion, the correct equation should be:
1. m <- \beta m + \eta \nabla J(\theta - \beta m)
2. \theta <- \theta - m
Hope this is helpful.
Note from the Author or Editor: Good catch, thanks. Indeed, I flipped the signs, so the steps should be:
1. m := beta * m - eta * gradient_at(theta + beta * m)
2. theta := theta + m
Latexmath:
\begin{split}
1. \quad & \mathbf{m} \gets \beta \mathbf{m} - \eta \nabla_\mathbf{\theta}J(\mathbf{\theta} + \beta \mathbf{m}) \\
2. \quad & \mathbf{\theta} \gets \mathbf{\theta} + \mathbf{m}
\end{split}
I often see m interpreted as the negative gradient, in which case the equations would be the following (that's what I was aiming for):
1. m <- beta * m + eta * gradient_at(theta - beta * m)
2. theta <- theta - m
However, I double checked: the figures and the text do not assume that m is the negative momentum, so I fixed the book as you suggested (and I also flipped the signs in the momentum optimization equations for consistency).
Thanks again, I very much appreciate your help,
Aurélien
|
Bicheng Ying |
Jun 12, 2017 |
Aug 18, 2017 |
Printed |
Page 298
RMSProp section |
rmsprop optimizer has the momentum=0.9 argument, however a momentum term is not included in equations 11.7
Note from the Author or Editor: Thanks for your feedback. Indeed, the "raw" RMSProp algorithm, as presented on slide 29 of Geoffrey Hinton's 6th Coursera lecture (https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) does not use momentum, so that's what I presented, but indeed TensorFlow's implementation does add the option to combine it with momentum optimization (regular, not Nesterov). This was suggested by Hinton on slide 30 ("Further developments of rmsprop").
I will clarify this for the next releases, thanks again for your feedback.
Cheers,
Aurélien
|
Anonymous |
Mar 27, 2018 |
Oct 12, 2018 |
Printed |
Page 299
Equation 11-8 Adam algorithm |
In step 3 & step 4 of Adam algorithm in the book, the term 'm' and 's' are updated.
According to the original paper of Adam-algorithm(https://arxiv.org/abs/1412.6980), they should not be updated in iterations. The unbiased version of 'm' and 's' should only be used to calculate theta in next generation.
Note from the Author or Editor: Great catch Zhao! Indeed, I forgot the hats in steps 3, 4 and 5:
3. \hat{m) <- m / (1 - {\beta_1} ^ t)
4. \hat{s) <- s / (1 - {\beta_2} ^ t)
5. \theta <- \theta + \eta \hat{m} \oslash \sqrt{\hat{s} + \epsilon}
Thanks again,
Aurélien
|
Zhao yuhang |
Feb 25, 2018 |
Oct 12, 2018 |
Printed |
Page 300
Figure 11-6 |
Your example how the Nesterov update is converging faster is wrong in my opinion. The example is a function with one variable (teta). In consequence, the gradient (and the momentum, too) is one-dimensional in each point. But you draw them as 2-dim tangential vectors, which leads in your wrong assumption that the Nesterov update is going closer to the optimum in this example.
Why is your assumption wrong:
You can clearly see in the graph that
- eta * gradient1 > - eta * gradient2 > 0
and
beta * m > 0. (looking at the x-component)
This leads to
beta * m - eta gradient1 > beta * m - eta gradient2 > 0
and
|beta * m - eta gradient1| > |beta * m - eta gradient2| > 0
which is a clear contradiction to your drawing.
There are real examples when the Nesterov update is better than the regular momentum update:
- It is crossing a local minimum / stationary point faster
- If the regular momentum update goes farther than the optimum, the Nesterov update does not go as far away from the optimum (in some situations).
#stilllovingyourbook
Note from the Author or Editor: Excellent catch, thanks! I tried to fix the figure while keeping the cost function 1D, but it looked bad, and it didn't make Nesterov Accelerated Gradient seem very useful at all, so I ended up changing the figure altogether to make the cost function 2D. Hopefully it should be in the tenth release of the 1st edition (which should come out very shortly, in December 2018).
Thanks again!
|
Niclas von Caprivi |
Sep 03, 2018 |
Dec 07, 2018 |
PDF |
Page 305
First line of last paragraph |
Beginning phrase of the second sentence in the last paragraph says: Suppose p = 50, ....
Since P is a probability with value from 0 to 1, It would be nice to explicitly state it as p = 50 % or 0.5 so as to avoid ambiguity
Note from the Author or Editor: You are right, there's a % sign missing, it should read "suppose p = 50%".
Thanks!
|
Denis Oyaro |
May 27, 2017 |
Jun 09, 2017 |
PDF |
Page 313
1st paragraph |
For further callback method names, looks like "on_epoch_begin()" is there twice, but no "..._end". Same for "on_batch_end()" where there's no "_begin". A copy/paste mixup?
Note from the Author or Editor: Great catch, thanks!
The sentence should be:
As you might expect, you can implement `on_train_begin()`, `on_train_end()`, `on_epoch_begin()`, `on_epoch_end()`, `on_batch_begin()` and `on_batch_end()`.
|
Gregory Deal |
Jun 12, 2019 |
|
Printed |
Page 322
Figure 12-5 |
(1st Edition)
In Fig 12-5, Both CPU and GPU has inter-op and intra-op.
But AFAIK, inter-op and intra-op is for CPU.
Refer to https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu and https://stackoverflow.com/questions/41233635/tensorflow-inter-and-intra-op-parallelism-configuration
Please check this again.
Thank you.
Note from the Author or Editor: Thanks for your feedback. This is a great question! I wasn't quite sure about this when I was writing this chapter, so I asked the TensorFlow team, and here is the answer I got from one the team leads:
"""
[...]in my experience parallelism isn't very significant to GPU ops, since most of the acceleration is achieved under the hood with libraries like cudnn that do intra-op parallelism automatically, and [...] tend to take over the whole GPU.
As far as your diagram goes, I believe that we might support running multiple GPU threads through separate executor streams via StreamExecutor, but it's generally not a good idea from a performance point of view.
"""
So, my understanding was that, on the GPU, the intra-op thread pool exists, although it is managed by libraries such as cuDNN rather than by TensorFlow itself: I decided that this was an implementation detail (after all, TensorFlow is based on cuDNN), so I included the intra-op thread pool on the diagram, but it's true that it is not a configurable thread pool, contrary to the CPU inter-op thread pool.
Since TensorFlow must run operations in the proper order when there are dependencies, and since it manages execution using an inter-op thread pool for the CPU, I assumed that it must be the case as well for GPUs.
However... reading your question, it got me thinking about this some more, and I realized I could actually simply run a test. The conclusion is that I was wrong: there does NOT seem to be an inter-op thread pool for GPUs: TensorFlow just decides on a particular order of execution (deterministically, based on the dependency graph), then it runs the operations sequentially (however each operation may have a multi-threaded implementation).
So I will update this diagram and the corresponding paragraph.
I don't think it's a severe error, because it won't change much for users in terms of code, but it's a very useful clarification to avoid confusion.
I published the code of my experiment in this gist, in case you are interested:
https://gist.github.com/ageron/b378479efdf7e501bd270d032000fcc1
Thanks a lot!
Cheers,
Aurélien
|
Haesun Park |
Dec 04, 2017 |
Jan 19, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 328
Code section under Pinning Operations Across Tasks |
Missing colon in the with statement below
with tf.device("/job:ps/task:0/cpu:0")
a = tf.constant(1.0)
with tf.device("/job:worker/task:0/gpu:1")
b = a + 2
Note from the Author or Editor: Good catch! Thanks. Indeed, the code sample should look like this:
with tf.device("/job:ps/task:0/cpu:0"):
a = tf.constant(1.0)
with tf.device("/job:worker/task:0/gpu:1"):
b = a + 2
c = a + b
Thanks a lot,
Aurélien
|
Hei |
Jun 15, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 333
Equation 10-2. Perceptron learning rule |
Page number of error is not exact since I have the kindle (azw) version.
The error is at Chapter 10. Perceptron learning rule of Equation 10-2.
W(next step) = W + eta(y_hat - y)x # (estimation - true_label)
should be
W(next step) = W + eta(y - y_hat)x # (true_label - estimation)
Note from the Author or Editor: Good catch, indeed this is a mistake. Equation 10-2 should have target - estimation rather than estimation - target. In latex math, the equation should be:
{w_{i,j}}^{(\text{next step})} = w_{i,j} + \eta (y_j - \hat{y}_j) x_i
rather than:
{w_{i,j}}^{(\text{next step})} = w_{i,j} + \eta (\hat{y}_j - y_j) x_i
Thank you!
|
Lee, Hyun Bong |
Apr 30, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 333
Code section at the top |
In the second line of the code, it should call q.enqueue_many() instead of q.enqueue()
So, that line should be:
enqueue_many = q.enqueue_many([training_instances])
Note from the Author or Editor: Good catch! The text says you can use "enqueue_many" and then I use "enqueue" in the code example, I was probably out of coffee. ;-) That line of code should be:
enqueue_many = q.enqueue_many([training_instances])
Thanks a lot,
Aurélien
|
Hei |
Jun 19, 2017 |
Aug 18, 2017 |
PDF |
Page 337
Above 'Closing a queue' |
1st edition, 5th release.
In code block above 'Closing a queue',
dequeue_a, dequeue_b should be dequeue_as, dequeue_bs.
Thanks
Note from the Author or Editor: Good catch! That's a typical copy/paste error, sorry about that. Indeed, it should be dequeue_as and dequeue_bs, instead of dequeue_a and dequeue_b. Thanks a lot.
|
Haesun Park |
Feb 08, 2018 |
Oct 12, 2018 |
PDF |
Page 342
2nd paragraph from bottom |
In the 2nd edition, p. 342, you mention "shear luck". I think this should be "sheer luck", unless sheep have some effect I never heard of!
Note from the Author or Editor: Haha, good catch! :)
It should indeed be "sheer luck".
Cheers!
|
Gregory Deal |
Jul 04, 2019 |
|
|
360
Exercises 8-2 |
Thanks for this excellent book.
I am interested in particular in distributing TensorFlow. Unfortunately, there is no solution online for exercises 8-10 of chapter 12.
Do you plan to complete the corresponding notebook?
Thanks,
Giovanni
Note from the Author or Editor: Thanks for your feedback. Yes, sorry about that, exercise solutions took me way more time than I initially planned, and this chapter was a bit tricky because it required getting the user to set up various infrastructures (TF Serving, GCP, TF cluster, and so on). I chose to focus on the other chapters first, and never reached this one.
However, I recently answered a question about this topic on github:
Please take a look at my TF2 course notebooks at https://github.com/ageron/tf2_course
In particular 03_loading_and_preprocessing_data.ipynb and 04_deploy_and_distribute_tf2.ipynb.
There are two main scenarios when you go to the cloud:
* Running: you have already trained the model locally and you just want to run a web service that executes it.
* Training: you want to train your model at a large scale on the Cloud.
Running a trained model on GCP is not too hard. First, learn to deploy on TF Serving (as shown in the notebook), then basically you can use GCP as a hosted TF Serving.
For training (e.g., on TPU), check out this Colab notebook:
https://colab.research.google.com/github/GoogleCloudPlatform/training-data-analyst/blob/master/courses/fast-and-lean-data-science/01_MNIST_TPU_Keras.ipynb
Hope this helps,
Aurélien
|
Giovanni |
Feb 02, 2019 |
Mar 08, 2019 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 361
Undder # Create 2 filters comment |
he means to define the line filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32) , but he calls the variable filters_test, like the two lines below it . The jupyter notebook doesn't make that mistake, though
Note from the Author or Editor: Good catch, thanks! I probably renamed the variable at one point and missed a few occurrences, sorry about that.
This is now fixed, but it may take some time to propagate to production.
|
Joseph Vero |
Apr 30, 2017 |
Jun 09, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 361
Ch 13, Paragraph after Fig 13-6 |
Text says
------------
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sw to i × sw + fw – 1 and columns j × sh to j × sh + fh – 1, across all feature maps (in layer l – 1).
Concern
-----------
The book defines sw as horizontal stride, and sh as vertical stride. Cool.
My intuition is that the horizontal stride changes the feature map's number of columns. And vertical stride changes the the feature map's number of rows.
Should it be:
a) the horizontal stride sw (not sh) should affect the column ranges?
b) the vertical stride sh (not sw) should affect the row ranges?
Correction
--------------
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sh to i × sh + fw – 1 and columns j × sw to j × sw + fh – 1, across all feature maps (in layer l – 1).
Please forgive me if I'm wrong. Just plodding through the book and doing 'back of the envelope' calculations/exercises as I go.
Regards,
dre
Note from the Author or Editor: Good catch, this is indeed an error, my apologies. Moreover, it helped me find an error in Equation 13-1. I double-checked the rest of pages 357-361 and they seem fine to me.
The sentence at the bottom of page 361 should be:
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer l is connected to the outputs of the neurons in the previous layer l - 1, located in rows i x sh to i x sh + fh - 1 and columns j x sw to j x sw + fw - 1, across all feature maps (in layer l - 1).
The Equation 13-1 should be (using latexmath):
z_{i,j,k} = b_k + \sum\limits_{u = 0}^{f_h - 1} \, \, \sum\limits_{v = 0}^{f_w - 1} \, \, \sum\limits_{k' = 0}^{f_{n'} - 1} \, \, x_{i', j', k'} . w_{u, v, k', k}
\quad \text{with }
\begin{cases}
i' = i \times s_h + u \\
j' = j \times s_w + v
\end{cases}
The difference is that u, v and k' must be zero-indexed, and i'=i x sh + u instead of i'=u x sh + fh - 1, and similarly j' = j x sw + v instead of j' = v x sw + fw - 1.
You can view the updated equation (and all equations in the book) at:
http://nbviewer.jupyter.org/github/ageron/handson-ml/blob/master/book_equations.ipynb
Thank you very much for your help,
Aurélien Géron
|
andre trosky |
Jun 22, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 362
1st para after Tensorflow Implementation title |
Text
-----
The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn, fn′].
Concern
-----------
On the same page above (p263), fn' is defined as the number of feature maps in the previous (l-1) convolutional layer. Let's assume then that fn is the number of features in the l convolutional layer.
The Tensorflow API for tf.nn.conv2d has the 'filter' parameter defined as
[filter_height, filter_width, in_channels, out_channels].
Which using your current nomenclature means the text should read:
Correction
-------------
The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn', fn].
Additional
-------------
The TF implementation code on p363 defines the variable named 'filters' as:
[...]
filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32)
[...]
Meaning it does adhere to the TF API for tf.nn.conv2d.
Note from the Author or Editor: Good catch! Yes indeed, it should read:
The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn', fn].
Thank you!
Aurélien
|
andre trosky |
Jun 23, 2017 |
Aug 18, 2017 |
Printed, |
Page 365
TIP below Figure 13-9 |
In TIP box below Fig 13-9,
I think that stacking two 3 x 3 kernels has same effect as a 5 x 5 kernel not 9 x 9 kernel.
Two 3 x 3 conv have a 5 x 5 effective receptive field.
Thanks.
Note from the Author or Editor: Great catch, thanks! I fixed the tip like so:
A common mistake is to use convolution kernels that are too large. For example, instead of using a convolutional layer with a 5 × 5 kernel, it is generally preferable to stack two layers with 3 × 3 kernels: it will use less parameters and compute, and usually perform better.
|
Haesun Park |
Aug 30, 2018 |
Dec 07, 2018 |
Printed |
Page 366
1st paragraph |
On the fourth line, the sentence says "it also create the bias variable (named bias) and initializes it with zeros". I believe the word "create" should be changed to "creates" adding an "s".
Note from the Author or Editor: Nice catch, thanks! It should indeed say "It also creates the bias variable" rather than "It also create the bias variable".
I just fixed the error.
Cheers,
Aurélien
|
Zoe Wexler |
Apr 25, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 383
Ch 14 equation 14-1 Output of a single recurrant neuron for a single instance |
Eq 14-1 implies:
---------------------
The value y(t) is a vector quantity.
Concern
------------
The way I understand Figure 14-2 is that each neuron in the recurrant layer outputs a single scalar value per timestep, and these scalars make up the vector quantity y.
Specifically, each element of the vector y comes from only one of the neuron's output in the recurrant layer.
But single neuron equation 14-1 implies that y(t)) is a vector quantity.
Dimensional analysis of Eq 14-1 requires the value of y(t) to be a scalar if:
1. bias b is a scalar and
2. x(t) and w_x and y_t-1 and w_y are vectors
Eq 14-1 Correction
-------------------------
y(t) should not be bold face, implying that it's a scalar quantity specific to one neuron.
Note from the Author or Editor: Once again, good catch! My intention was actually to show the equation for a whole recurrent layer on a single instance (i.e., on one input sequence), not for a single neuron. So the equation is correct but the title is wrong. It should have been:
Equation 14-1. Output of a recurrent layer for a single instance
I will also fix the sentence introducing this equation, replacing "single recurrent neuron" with "recurrent layer":
The output of a recurrent layer can be computed pretty much as you might expect, as shown in Equation 14-1 [...]
Thanks for your very helpful feeback,
Aurélien Géron
|
andre trosky |
Jun 25, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 383
Ch 14, explanation of terms in Eq 14-2 |
Text says
-------------
b is a vector of size n_neurons containing each neuron's bias term.
Concern
------------
Using this definition of b, the only way to properly add all the terms inside Eq 14-2 is to broadcast the bias term. Otherwise we're adding terms of different shapes.
The text does not mention this explicitly and can be confusing if you don't know what's 'going on under the hood', i.e broadcasting of b.
Let's assume the shape of the bias term is (1, n_neurons), therefore having size of n_neurons. In Eq14-2 (the first line), the other two product terms inside the activation function result in a shape of:
1. Shape of X_(t) . W_x is = (m, n_neurons)
2. Shape of Y_(t-1) . W_y is = (m, n_neurons)
Which leads to requiring the bias term to also be of shape (m, n_neurons), so we broadcast m times along b's first dimension.
(This broadcasted shape of b also works in the second line of Eq 14-2.)
Correction
--------------
Maybe mention that the bias is being broadcasted (for those of us who are unfamiliar with it), or otherwise change the definition of its shape to be (m, n_neurons)?
Note from the Author or Editor: That's a great point. In fact, I should have mentioned this earlier, in chapter 10, the first time we use broadcasting when adding a bias vector. I just added the following sentence at the end of point 5 at the bottom of page 266:
Note that adding a 1D array (*b*) to a 2D matrix with the same number of columns (*X* . *W*) results in adding the 1D array to every row in the matrix: this is called _broadcasting_.
Thanks a lot,
Aurélien Géron
|
andre trosky |
Jun 25, 2017 |
Aug 18, 2017 |
Other Digital Version |
386
Jupyter notebook |
The problem I observed was actually with the Jupyter notebook "14_recurrent_neural_networks.ipynb" currently (2017 July 13) on GitHub -- but the particular code with the problem is associated approximately with the text on page 386 of the printed book (illustrating the "static_rnn()" function).
Specifically, the output of "In [14]:" (show_graph(tf.get_default_graph())), which is supposed to be a graph of some kind, is instead a big empty space (1200 px X 620 px).
Similarly, the output of "In [26]:", in code demonstrating the result of "dynamic_rnn()", is also a big blank space.
Looking at the Firefox web-developer "Console" window, I see two JavaScript logging items which seem to say that "HTML Sanitizer" has changed the "iframe.srcdoc" value from what appears to be meaningful data to "null". Specifically, code in "/notebook/js/main.min.js" seems to be the place doing the sanitizing.
Configuration: Windows 7 64-bit, Firefox 48.0, Anaconda3 version 4.4.0 (2017-05-11), Python 3.6.1, Jupyter 5.0.0, TensorFlow 1.2.1. So, some of the package versions are later than the book, but I think the issue here is worth investigating.
Aside: This particular notebook ("14_recurrent_neural_networks.ipynb") currently contains a few more minor problems: "In [69]:", "In [77]:", and "In [103]", all call functions which begin with "rnd". However, while it seems previous versions of the notebook included a statement "import numpy.random as rnd", the code has evidently been changed so that "rnd" is no longer defined. Changing the three instances of "rnd" to "numpy.random" fixes all three problems -- enabling the entire notebook to be executed in Jupyter (but the problem mentioned at the top of this note, namely the blank graph areas, remains; but does not cause the notebook to stall execution midway, perhaps because the operation succeeded but was "sanitized" away).
Note from the Author or Editor: Thanks for your feedback. I just fixed the `rnd` issue in the Jupyter notebook, and I pushed the updated notebook to github (FYI, I use these imports so often that I added them to my python startup script, which is why I was not getting any error).
Regarding the `show_graph()` function, it does not seem to work across all browsers, unfortunately. I use Chrome, and the graph is displayed just fine, but some people have reported that it fails on Firefox, indeed. I'll try to find a way to make it work in Firefox, but in the meantime, the official way to visualize a TensorFlow graph is to use TensorBoard (see chapter 9).
|
Colin Fahey |
Jul 13, 2017 |
Aug 18, 2017 |
Printed |
Page 386
Las paragraph |
"... each with an input sequence composed of exactly two inputs..."
Should not it be three inputs? The mini-batches are 4 by 3. If it 2 and I got it wrong, then perhaps it should be clarified.
Note from the Author or Editor: Thanks for your question. The text is correct, it is exactly two inputs, but I changed the wording to clarify:
BEFORE:
This mini-batch contains four instances, each with an input sequence composed of exactly two inputs.
AFTER:
This mini-batch contains four instances, where each instance is a sequence composed of exactly two 3D inputs. For example, the first instance is the sequence [0, 1, 2], [9, 8, 7].
I hope this is clearer.
|
Juan Manuel Parrilla Gutierrez |
Nov 05, 2018 |
Dec 07, 2018 |
Printed |
Page 395
Figure 14-8 |
OutputConnectionWrapper should be OutputProjectionWrapper.
Note from the Author or Editor: Good catch, indeed this was a typo: it's not OutputConnectionWrapper but OutputProjectionWrapper. The notebook was okay though. I fixed the book.
Thanks!
|
Anonymous |
Sep 29, 2017 |
Nov 03, 2017 |
PDF |
Page 405
6th line of the code |
reuse_vars_dict = dict([(var.name, var.name) for var in reuse_vars])
should be:
reuse_vars_dict = dict([(var.name, var) for var in reuse_vars])
Note from the Author or Editor: Good catch, thanks! Indeed, it should read:
reuse_vars_dict = dict([(var.name, var) for var in reuse_vars])
I've updated the book, it should be live within a few weeks for the digital versions.
|
James Wong |
May 17, 2017 |
Jun 09, 2017 |
Printed |
Page 407
Equation 14-4 |
The equation for h_t appears to be incorrect. Instead of
h_t = (1 - z_t) * h_(t - 1) + z_t * g_t
the Cho et al. (2014) paper has in equation 7
h_t = z_t * h_(t - 1) + (1 - z_t) * g_t
Accordingly, the “1-” unit in figure 14-14 on p. 406 should be moved right, to the path leading from z_t to the multiplication with the output of g_t. (And the label for the g_t is missing.)
Your visualizations of the RNN cells are a great help, and are much appreciated!
Note from the Author or Editor: Thanks for your feedback. You are right that my graph & equations inverted z_t and 1 - z_t. Fortunately, the GRU cell works fine either way. Indeed, the z gate is trying to learn the right balance between forgetting old memories (let's call this f) and storing new ones (let's call this i). In a GRU cell, f = 1 - i. In the paper, the z gate outputs f, while in my book, it outputs i. Either way, the right balance will be found just as well.
If you want an analogy, it's as if you were learning how empty a glass should be, while I was learning how full it should be. The net result is the same, but somehow I find the latter a bit more natural. ;) That said, even though "my" equations will work fine, I will fix them so that people don't get confused when they see other implementations or read the paper.
|
Nick Pogrebnyakov |
Oct 10, 2017 |
Nov 03, 2017 |
Printed |
Page 420
1st sentence |
(1st edition)
In first bullet above 'Training One Autoencoder at a Time',
"First, weight3 and ..."
should be
"First, weights3 and ..."
Thanks.
Note from the Author or Editor: Yet another good catch, thanks Haesun. Fixed to weights3.
|
Haesun Park |
Dec 26, 2017 |
Oct 12, 2018 |
Other Digital Version |
422
last paragraph |
Jupyter Notebook: 15_autoencoders
cell : Unsupervised pretraining, In [30]
- weights3_init = initializer([n_hidden2, n_hidden3]) to;
weights3_init = initializer([n_hidden2, n_outputs])
and
- biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3") to;
biases3 = tf.Variable(tf.zeros(n_outputs), name="biases3")
Wondering why is that,
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
is not causing an error?
Note from the Author or Editor: Nice catch! Indeed, this is a typo. It does not explode because n_hidden3 is defined in [23], and it is equal to 300. So the network has 300 outputs instead of 10. The function sparse_softmax_cross_entropy_with_logits() does not explode because it expects the target labels to be between 0 and 299, which is the case (since the labels are between 0 and 9). So the network simply learns to ignore classes 10 to 299.
I'll fix this today, thanks a lot for your feedback, this is very helpful. :)
|
Anonymous |
Sep 08, 2017 |
Nov 03, 2017 |
Printed |
Page 425
Under a note |
(1st edition)
A paper link (https:\/\/goo.gl/R5L7HJ) is broken.
Please refer this(http:\/\/www.iro.umontreal.ca/~lisa/pointeurs/BengioNips2006All.pdf)
Thanks.
Note from the Author or Editor: Thanks Haesun. I updated the short link to: https://goo.gl/smywDc
It points to: https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
which seems more likely to be stable, given that it's hosted by nips.cc instead of a user folder.
Thanks again!
|
Haesun Park |
Dec 27, 2017 |
Oct 12, 2018 |
Printed |
Page 436
Exercises 8 |
In first bullet of Ex 8,
A short url for "download_and_convert_data.py" is broken.
It should be linked to "https://github.com/tensorflow/models/blob/master/research/slim/download_and_convert_data.py"
Thanks.
Note from the Author or Editor: Thanks Haesun. Yikes, it's the second time this link breaks, they keep moving folders around. Perhaps I should point to a search query instead. ;)
For now, I've updated the link to this short link: https://goo.gl/fmbnyg
|
Haesun Park |
Dec 29, 2017 |
Oct 12, 2018 |
Printed |
Page 437
Exercises 9 |
(1st edition) In last bullet of Ex 9,
"Jinma Gua" should be "Jinma Guo"
Thanks :)
Note from the Author or Editor: Good catch, thanks Haesun. Fixed to Guo.
|
Haesun Park |
Dec 29, 2017 |
Oct 12, 2018 |
Printed |
Page 439
footnote 1. |
(1st edition)
In footnote 1, RL book link(https://goo.gl/7utZaz) is broken,
Please refer this(http:\/\/www.incompleteideas.net/book/the-book-2nd.html)
Thanks.
Note from the Author or Editor: Thanks Haesun. I actually fixed this link already in the latest release:
https://goo.gl/K1Gibs
But your link is better, as it points to the latest edition, so I'm updating it to:
https://goo.gl/AZzunZ
Thanks again!
|
Haesun Park |
Jan 02, 2018 |
Oct 12, 2018 |
Printed |
Page 441
bottom |
More information in RL: See footnote 1:
https://goo.gl/7utZaz
This link can't be executed and return a "404 Not Found"
Note from the Author or Editor: Thanks for your feedback. Indeed, this URL was broken, I fixed it in the latest release. The new URL for this book is: https://goo.gl/AZzunZ
Thanks again,
Aurélien
|
Anonymous |
Apr 01, 2018 |
Oct 12, 2018 |
Printed |
Page 446
2nd code block |
The code shows the creation of "your first environment" and reads:
>>> import gym
>>> env = gym.make("CartPole-v0")
[2016-10-14 16:03:23,199] Making new env: Ms Pacman-v0
[,...]
the output code was probably copied from further onto the chapter, since it should be (and I quote my own output)
[2017-09-13 10:48:27,402] Making new env: CartPole-v0
Note from the Author or Editor: Nice catch, thanks! I fixed this, the next digital and paper editions should be good.
|
Francesco Siani |
Sep 13, 2017 |
Nov 03, 2017 |
Printed, |
Page 449
Last paragraph |
In Chpater 16,
Discount rate(r) is different discount factor(\gamma).
Discount factor \gamma = 1/(1+r).
So I recommend:
'discount rate' in text should be 'discount factor'.
'discount_rate' in code should be 'discount_factor'.
Thanks.
Note from the Author or Editor: Thanks Haesun, that's a good point. I used "discount rate" to mean "discount factor", and I have seen several people do the same, but you are right that it's clearer to replace "discount rate" with "discount factor" everywhere in chapter 16. I just did this. In the code examples, I have a constraint of using 80 characters max per line, so I cannot easily replace discount_rate with discount_factor, so instead I replaced discount_rate with gamma, with a comment in the code every time I define gamma, for example:
gamma = 0.95 # the discount factor
Also, the first time that the discount factor is introduced (just before figure 16-6), instead of naming it "r", I named it gamma. This avoids possible confusion with rewards (which are named "r") later in the chapter, and it also makes the chapter more consistent.
Thanks for your suggestion!
|
Haesun Park |
Jan 11, 2018 |
Oct 12, 2018 |
Printed |
Page 456
2d paragraph |
Labels (a1, a2, s2, s3) in the text for Figure16-8 are incorrectly printed.
Note from the Author or Editor: Thanks for your feedback. I'm not sure exactly what you mean by "printed incorrectly". Are you referring to the text font (I am not seeing a problem)? Or to the fact that the text contained a couple errors (e.g., inverted a1 and a2, and s2 and s3)? I assume it's the latter. I fixed these errors:
BEFORE: In state _s_~1~ it has only two possible actions: _a_~0~ or _a_~1~. It can choose to stay put by repeatedly choosing action _a_~1~, or it can choose to move on to state _s_~2~ and get a negative reward of -50 (ouch). In state _s_~3~ it has no other choice [...] and in state _s_~3~ the agent has no choice but to take action [...].
AFTER: In state _s_~1~ it has only two possible actions: _a_~0~ or _a_~2~. It can choose to stay put by repeatedly choosing action _a_~0~, or it can choose to move on to state _s_~2~ and get a negative reward of -50 (ouch). In state _s_~2~ it has no other choice [...] and in state _s_~2~ the agent has no choice but to take action [...]
Thanks again!
Aurélien
|
Yevgeniy Davletshin |
Jun 15, 2017 |
Aug 18, 2017 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 459
Code section under "Now let’s run the Q-Value Iteration algorithm" |
The learning_rate is defined as 0.01 but it is never used. It is not a real problem as the algorithm doesn't take a learning rate. But it is confusing when you read the code.
Note from the Author or Editor: Indeed, the learning_rate is unused in this code, I just removed it.
Thanks!
Aurélien Géron
|
Hei |
Jun 23, 2017 |
Aug 18, 2017 |
Printed |
Page 459
Code at top of page |
Comparing the contents of the array R in the Python code to Figure 16-8, I believe the second occurrence of the value -10.0 in the definition of R should be 0.0, in this code:
R = np.array([ # shape=[s, a, s']
:
[[10., 0.0, 0.0], [nan, nan, nan], [0.0, 0.0, -50.]],
:
])
There is no transition from state S1 through action a0 back to state S0, so a reward of +10 does not do anything here.
It also does no harm, but it may reduce some confusion :)
Note from the Author or Editor: Ha! Good catch! :) Indeed, that line should read:
[[0.0, 0.0, 0.0], [nan, nan, nan], [0.0, 0.0, -50.]],
As you point out, this is a reward for a transition that has 0% probability, so it doesn't change the result, but I agree that it's potentially confusing. I've fixed it in the book (the notebook was already okay, somehow, I must have noticed the issue in the notebook at some point, but forgot to fix it in the book).
Thanks for your help! :)
|
Wouter Hobers |
Sep 28, 2017 |
Nov 03, 2017 |
PDF |
Page 460
Equation 16-6 |
\max_{\alpha'} should say \max_{a'}
Note from the Author or Editor: Great catch, thanks! Alpha looks so much like "a", especially in small font like this, I would never have noticed. The error is fixed now.
|
joseluisfb |
Apr 18, 2018 |
Oct 12, 2018 |
PDF |
Page 461
Q-Learning code example |
I've downloaded the latest version of the PDF and ePub from my OReilly account.
PDF is version is 2017-06-09.
ePub is version 2017-06-09
Concern
------------
The code for Q-Learning seems to not match Equation 16-5. In particular, how the learning rate (aka alpha) is used.
Currently code reads:
[...]
Q[s, a] = learning_rate * Q[s, a] + (1 - learning_rate) * (
reward + discount_rate * np.max(Q[sp])
)
[...]
To agree with Equation 16-5 it should be:
[...]
Q[s, a] = (1 - learning_rate) * Q[s, a] + learning_rate* (
reward + discount_rate * np.max(Q[sp])
)
[...]
I've checked though the Jupyter notebook for the Reinforcement chapter and it looks to agree with Equation 16-5, albeit the code is setup a little different.
Aside
--------
It looks like this latest version of the PDF, ePub doesn't have some of the corrections previously fixed.
e.g. The description for the states in Figure 16-7 p456 PDF still refer to the nonexistent state s3. I haven't checked any other fixed errata but could it be that O'Reilly have not correctly setup/linked to the newest up-to-date version? Just adding another 'data point' to hopefully help if it's confusing others.
Or I'm doing something wrong. I don't know.
Almost at the end :) Thanks again for the book, stuff is finally 'clicking'.
Note from the Author or Editor: Good catch, thanks! Indeed, the code was wrong, it should have been as you said, reversing (1 - learning_rate) and learning_rate, just like in Equation 16-5. I just pushed the fix to O'Reilly's git repo, so both the digital editions and new printed books should be fixed within the next couple of weeks.
Regarding the description of Figure 16-7, it is normal that it mentions state s3 since that state exists on the figure. However, if you see state s3 still mentioned in the description of Figure 16-8, then there's a problem. I'll contact O'Reilly to to make 100% sure that all the digital editions are up to date.
Note: I have sync'ed the code examples from all chapters with the code in the Jupyter notebooks, except for chapters 15 and 16, which are not 100% synchronized yet.
|
andre trosky |
Jul 20, 2017 |
Aug 18, 2017 |
Printed, |
Page 461
code block |
(In revised Printed Version and Safari Online)
Above 'Exploration Policies', Q[s, a] assignment need a closing parenthesis.
Q[s, a] = ((1 - learning_rate) * Q[s, a] +
learning_rate * (reward + discount_rate * np.max(Q[sp]))
should be
Q[s, a] = ((1 - learning_rate) * Q[s, a] +
learning_rate * (reward + discount_rate * np.max(Q[sp])))
And, at small code block for X_action placeholder and q_value in page 468,
Loss should be calculated from online_q_value not target_q_value.
q_value = tf.reduce_sum(target_q_values * tf.one_hot(X_action, n_outputs),
axis=1, keep_dims=True)
should be
q_value = tf.reduce_sum(online_q_values * tf.one_hot(X_action, n_outputs),
axis=1, keep_dims=True)
Thanks.
Note from the Author or Editor: Thanks Haesun, indeed there was a missing closing parentheses. I just fixed this.
|
Haesun Park |
Jan 10, 2018 |
Oct 12, 2018 |
Printed |
Page 469
the main loop |
Dear Mr. Géron,
First thank you very much for the wonderful book!
I am a bit confused when comparing the book with the nature paper "Human-level control through deep reinforcement learning". Please see Algorithm 1 in Methods.
Is there an exact correspondence between actor/critic in your book, and theta/theta^- in the paper? In the paper theta plays AND learns, however in the book actor plays and critic learns.
Thank you again for the book and for you precious time!
All the best,
Yehua
Note from the Author or Editor: Thanks a lot for your question, you helped me find the worst errors so far in the book. I fixed the Jupyter notebook for chapter 16 and I added a message at the beginning of the "Learning to play MsPacman with the DQN algorithm" section with the details of the errors:
1. The actor DQN and critic DQN should have been named "online DQN" and "target DQN" respectively. Actor-critic algorithms are a distinct class of algorithms.
2. The online DQN is the one that learns and is copied to the target DQN at regular intervals. The target DQN's only role is to estimate the next state's Q-Values for each possible action. This is needed to compute the target Q-Values for training the online DQN, as shown in this equation:
y(s,a) =r + g * max_a' Q_target(s′,a′)
* y(s,a) is the target Q-Value to train the online DQN for the state-action pair (s,a).
* r is the reward actually collected after playing action a in state s.
* g is the discount rate.
* s′ is the state actually reached after played action a in state s.
* a′ is one of the possible actions in state s′.
* max_a' means "max over all possible actions a' "
* Q_target(s′,a′) is the target DQN's estimate of the Q-Value of playing action a′ while in state s′.
In regular approximate Q-Learning, there would be a single model Q(s,a), which would be used both for predicting Q(s,a) and for computing the target using the equation above (which involves Q(s', a')). That's a bit like a dog chasing its tail: the model builds its own target, so there can be feedback loops, which can result in instabilities (oscillations, divergence, freeze, and so on). By having a separate model for building the targets, and by updating it not too often, feedback loops are much less likely to affect training.
Apart from that I tweaked a few hyperparameters and I updated the cost function, but those are minor details in comparison.
I hope these errors did not affect you too much, and if they did, I sincerely apologize.
Post-mortem, lessons I learned:
1. Spend more time reading the original papers and less time (mis)interpreting people's various implementations.
2. Use proper metrics to observe progress (e.g., track the max Q-Value or the total rewards per game), instead of falling into the confirmation bias trap of thinking that the agent is making progress when it is not. Testing on a simpler problem first would also have been a good idea.
3. Be extra careful when you reach the final section of the final chapter: that's when you're most tempted to rush and make mistakes.
Again, I would like to thank you for bringing this issue to my attention, it's great to get such constructive feedback.
Cheers,
Aurélien Géron
|
Yehua Liu |
Aug 10, 2017 |
Nov 03, 2017 |
Printed |
Page 474
2nd line in 1st code segment |
I think q_value should be calculated using online_q_values instead of target_q_values.
Great book, super useful and clear, thanks!
Note from the Author or Editor: Great catch! Indeed, it should be online_q_values instead of target_q_values, thanks a lot!
(I just checked, the Jupyter notebook was okay, so I guess I fixed the notebook some time ago, and I forgot to fix the text, sorry about that).
|
Sebastian Lehner |
Dec 02, 2018 |
Mar 08, 2019 |
Printed, |
Page 479
Chapter 6's ex. 2 |
Chapter 6's ex. 2
Gini impurity calculation looks like 1-1^2/5-4^2/5=0.32 and 1-1^2/2-1^2/2=0.5
Adding parentheses is better. e.g. 1-(1/5)^2-(4/5)^2=0.32
Thanks.
Note from the Author or Editor: Thanks Haesun, indeed this notation can be confusing. I added parentheses.
|
Haesun Park |
Jan 20, 2018 |
Oct 12, 2018 |
Printed, PDF, ePub, Mobi, , Other Digital Version |
Page 486
Last bullet point |
In the last sub-question of the chapter 10 exercises you ask us to write the equation that computes the network output matrix Y as a function of X, W_h, b_h, W_o, and b_o.
You give the solution as follows.
Y = (X \cdot W_h + b_h) \cdot W_o + b_o
I understand why this could equation for Y could be correct but only if we ignore the ReLU activation functions for all of the artificial neurons.
It seems the solution would change when considering the activation functions of the 50 artificial neurons in the hidden layer and the 3 artificial neurons in the output layer, which all have ReLU activation.
When considering the ReLU activation of the 53 total artificial neurons would this be the a correct equation?
Y = max(max(X \cdot W_h + b_h, 0) \cdot W_o + b_o), 0)
Regardless of whether my equation is correct, I think this would be a more complete and informative exercise if you provided how the equation provided as the solution in the appendix would change (or not) when we consider the ReLU activation functions that you posed in the original question.
Otherwise, this is a very good and helpful exercise!
Note from the Author or Editor: Good catch, you are right, I forgot the ReLU activations! :( The answer should indeed be:
Y = max(max(X \cdot W_h + b_h, 0) \cdot W_o + b_o), 0)
It's also fine to write ReLU(z) instead of max(z, 0):
Y = ReLU(ReLU(X . W_h + b_h) . W_o + b_o)
I just updated the book, the digital versions will be updated with a couple weeks.
|
Shane |
May 27, 2017 |
Jun 09, 2017 |
Printed |
Page 491
3rd Paragraph |
The answer to the second part of question 2 in Chapter 13: Convolutional Neural Networks reads:
"...this first layer takes up 4 x 100 x 150 x 100 = 6 million bytes (about 5.7 MB)...The second layer takes up 4 x 50 x 75 x 200 = 3 million bytes (about 2.9 MB). Finally, the third layer takes up 4 x 25 x 38 x 400 - 1,520,000 bytes (about 1.4 MB). However, once a layer has been computed, the memory occupied by the previous layer can be released, so if everything is well optimized, only 6 + 9 = 15 billion bytes (about 14.3 MB) of RAM will be required (when the second layer has just been computed, but the memory occupied by the first layer is not released yet)."
For the situation described, if both the first and second layers are in memory, would that not be 3 + 6 = 9 million bytes (8.58 MB) of RAM required? When you add the amount occupied by the CNN's parameters (3,613,600 bytes) that would be a total of about 12 MB for predicting a single instance.
I could also be missing something really obvious so sorry if that is the case. Either way, thanks for the great, enjoyable book!
Note from the Author or Editor: You are correct, I have no idea why I wrote 6+9 instead of 6+3. Thanks a lot!
I just fixed the paragraph like this:
"""
However, once a layer has been computed, the memory occupied by the previous layer can be released, so if everything is well optimized, only 6 + 3 = 9 million bytes (about 8.6 MB) of RAM will be required (when the second layer has just been computed, but the memory occupied by the first layer is not released yet). But wait, you also need to add the memory occupied by the CNN's parameters. We computed earlier that it has 903,400 parameters, each using up 4 bytes, so this adds 3,613,600 bytes (about 3.4 MB). The total RAM required is (at least) 12,613,600 bytes (about 12.0 MB).
"""
|
Will Koehrsen |
Jul 21, 2017 |
Aug 18, 2017 |
PDF, |
Page 523
Last paragraph |
1st edition 5th release.
If a Hopfield nets contain 36 neurons, total connection is 630(=36*35/2) not 648. :)
Thanks.
Note from the Author or Editor: Good catch! Of course if there are n neurons, then there are 1+2+3+...+(n-1) = (n - 1) * n / 2 connections. It seems that I computed 36*36/2 instead of 35*36/2, probably a typo on my calculator. :/
Fixed, thanks once more!
|
Haesun Park |
Mar 08, 2018 |
Oct 12, 2018 |
Printed |
Page 550
Colophon |
My friend showed me his at the time favourite book and I woundered about the salamander mascot on the cover. I read the explanation from last page. Sorry, but the salamander shown is definitly not Salamandra infraimmaculata, but our native species Salamandra salamandra (I am a german biologist). S. infraimmaculata would have a more rounded head, a slightly different drawing, and -very important- NO black pigmentated ends of parotideal excretory ducts. Amongst other things.
So I checked the original source "The Illustrated Natural History". There, the amphibian of your picture is specified as S. maculata. This epitheton was used until 1955 for S. salamandra and is now synonymous with it.
Moreover, you described the Near Eastern fire salamander (Salamandra infraimmaculata) as "Far Eastern fire salamander found in the Middle East". Very Confusing and incorrect. Furthermore, no species of fire salamanders "lays their eggs in the water". In contrast to common frogs, fire salamanders are ovoviviparous. They deposit living tadpoles into the water.
Note from the Author or Editor: Thanks a lot for your very interesting feedback. I will forward your message to O'Reilly: they are the ones who select the animals on the book covers, and who write the corresponding text. Hopefully, they will fix this by the next release of the book.
|
Dr. Verena Wilhelmi |
Oct 23, 2018 |
Dec 07, 2018 |