Errata

Errata for Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Second Edition

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted	Date corrected
	? Section: Computing Gradients Using Autodiff	Super minor typo: just replace you must call the tape’s jabobian() method with you must call the tape’s jacobian() method	Thierry Herrmann	Sep 30, 2019	Oct 11, 2019
ePub	Page ch. 14 TensorFlow Implementation	For the code involving "load_sample_image" Pillow must be installed python3 -m pip install Pillow Otherwise we will get an error. This could be added as footnote, or in the "Create the Workpace" section in chapter 2. Note from the Author or Editor: Indeed, the Pillow package is required by the `load_sample_image()` function. I added a note. Thanks!	Mohammed El-Beltagy	Oct 27, 2019	Nov 22, 2019
ePub	Page ch. 14 TensorFlow Implementation, third bullet point	"stride length of 2" should be replaced by "stride length of 1" to be consistent with above code. Note from the Author or Editor: Great catch, it should indeed be "stride length of 1". Thanks!	Mohammed El-Beltagy	Oct 27, 2019	Nov 22, 2019
Mobi	Page ch. 14 TensorFlow Implementation (code)	outputs =tf.nn.conv2d(images,filters, strides=1, padding="same") should be changed to outputs =tf.nn.conv2d(images,filters, strides=1, padding="SAME") Note from the Author or Editor: Good catch! Indeed, the `tf.nn.conv2d()` function accepts only uppercase `padding` values. `keras.layers.Conv2D` supports both uppercase and lowercase arguments, and Francois Chollet told me that the lowercase values are preferred, so I updated the whole book. I didn't realize that `tf.nn.conv2d()` was different. Thanks!	Mohammed El-Beltagy	Oct 27, 2019	Nov 22, 2019
Other Digital Version	Chapter 14 Table 14.2	Missing max pooling layer between C7 and F8. 1313256≠4096 See the table at https://engmrk.com/alexnet-implementation-using-keras/ Note from the Author or Editor: Great catch, thanks! Indeed, a Max Pooling layer was missing just after the last convolutional layer. The new table looks like this (in AsciiDoc format): \|======= \| Layer \| Type \| Maps \| Size \| Kernel size \| Stride \| Padding \| Activation \| Out \| Fully connected \| – \| 1,000 \| – \| – \| – \| Softmax \| F10 \| Fully connected \| – \| 4,096 \| – \| – \| – \| ReLU \| F9 \| Fully connected \| – \| 4,096 \| – \| – \| – \| ReLU \| S8 \| Max pooling \| 256 \| 6 × 6 \| 3 × 3 \| 2 \| `valid` \| – \| C7 \| Convolution \| 256 \| 13 × 13 \| 3 × 3 \| 1 \| `same` \| ReLU \| C6 \| Convolution \| 384 \| 13 × 13 \| 3 × 3 \| 1 \| `same` \| ReLU \| C5 \| Convolution \| 384 \| 13 × 13 \| 3 × 3 \| 1 \| `same` \| ReLU \| S4 \| Max pooling \| 256 \| 13 × 13 \| 3 × 3 \| 2 \| `valid` \| – \| C3 \| Convolution \| 256 \| 27 × 27 \| 5 × 5 \| 1 \| `same` \| ReLU \| S2 \| Max pooling \| 96 \| 27 × 27 \| 3 × 3 \| 2 \| `valid` \| – \| C1 \| Convolution \| 96 \| 55 × 55 \| 11 × 11 \| 4 \| `valid` \| ReLU \| In \| Input \| 3 (RGB) \| 227 × 227 \| – \| – \| – \| – \|======= As you can see, I added the missing max pooling layer S8. Note that I had to rename layer F8 to F9, and layer F9 to F10, including in the sentence right after the table. Side note: if you want to use the Keras implementation at https://engmrk.com/alexnet-implementation-using-keras/, you should fix a few errors first: * Kernel size of 2nd conv layer is 5x5, not 11x11 * Pool size is 3x3 in all max pool layers, not 2x2 * All conv layers should use SAME padding. * AlexNet has 3 dense layers (including the output layer), not 4. Also, I recommend using tf.keras when TF is the desired backend, instead of multi-backend Keras (i.e., you should use "from tensorflow import keras" instead of "import keras"). I wrote this corrected version: https://gist.github.com/ageron/a38c67add35ba8dfcf19bc0fa12e47f0 If you want the exact same model as the original one, you will need to add the Local Response Normalization layers, and also split the model in two as explained in the paper (to run each part on a different GPU). But of course more recent models perform better, so this is purely academic! :) One last thing: you mention that 1313256≠4096. With the additional max pooling layer, we now have 66256 inputs going into the first fully connected layer. You might notice that 66256=9216, not 4096. That's okay: 4096 is the number of units in the layer, not the number of inputs. Thanks again for your help!	Mohammed El-Beltagy	Nov 05, 2019	Nov 22, 2019
	"Changes in the Second Edition," Numbered List Point 1	'covolutional' should be 'convolutional' (missing an 'n'). (I couldn't find page numbers in the Safari Books Online iPad app.) Note from the Author or Editor: Good catch, thanks. Fixed.	Leif Eric Fredheim	Jan 07, 2020	Mar 13, 2020
Other Digital Version	ch. 7 Code snippet before Extra-Trees section	"The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier: bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)" splitter="random" makes this BaggingClassifier not equivalent to RandomForestClassifier since splits in RandomForestClassifier are not random, but best splits made on random subsets of features. The following snippet fixes the issue: bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="best", max_features="auto", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1) With these parameters (and set random state) the predictions made by BaggingClassifier in 07_ensemble_learning_and_random_forests.ipynb will be identical to the predictions of RandomForestClassifier: >>> np.sum(y_pred == y_pred_rf) / len(y_pred) 1.0 Note from the Author or Editor: Thanks for your feedback, great analysis. I updated the code example to be: bag_clf = BaggingClassifier( DecisionTreeClassifier(max_features="auto", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1) I left out splitter="best" since it is the default value (and the line overflow would require changing the page layout, which I try to avoid when possible).	Slava Ilin	Jan 12, 2020	Mar 13, 2020
	ch. 3 Tip under Figure 3-6	The tip ends by noting that the PR curve "could be closer to the top-left corner". Assuming you're referring to Figure 3-5, does this mean the top-right corner? That curve, of course, hits the top-left corner. In either case, it's still not entirely clear to me why the ROC is more affected by skewed data. Perhaps this tip could be expanded.	Peter Drake	Feb 26, 2020	Mar 13, 2020
Other Digital Version	Ch 3 (code) Cell 24	When I run all of the cells up through from sklearn.metrics import precision_score, recall_score precision_score(y_train_5, y_train_pred) in the Jupyter notebook from GitHub (running on Colab), I get 0.837..., not the 0.729... shown in the (Safari) book. I believe the problem occurs at least as early as cell 22 (the confusion matrix two cells earlier), which gives: array([[53892, 687], [ 1891, 3530]]) rather than the 53057, 1522, 1325, 4096 shown in the book. This makes cell 25, 4096 / (4096 + 1522) rather mysterious, as the numbers 4096 and 1522 now seem to come out of nowhere. Note from the Author or Editor: Thanks for your feedback. Indeed, making the code perfectly reproducible for several years turns out to be quite a challenge! Every time a new version of Scikit-Learn (or NumPy, Keras, TensorFlow, Matplotlib, Pandas) is released, I have to check all the notebooks to ensure they still produce the same output. The most common source of changes is when the default value of some hyperparameter is modified. For example, if the default number of iterations changes, then all the results change. I managed to keep up with this up to now by explicitly setting some of the hyperparameter values to their old default value (or in some cases, to their new default value, when they were announced in advance). You'll see some comments about this in the notebooks. Unfortunately, sometimes the algorithms themselves get tweaked slightly, and there's really nothing I can do about that. I was fortunate enough to be mostly spared by this problem for the 1st edition, but my luck ran out: * Scikit-Learn 0.21 fixed some bug in SGDClassifier (and many other models), so models now produce slightly different results (see https://scikit-learn.org/0.21/whats_new.html#id6). This happened a couple months after I had finished writing the book, and it was off to press. * As if this wasn't enough, TensorFlow 2.1 completely changed the way it generates random numbers, compared to TensorFlow 2.0. So pretty much all TensorFlow models give slightly different results now, and there's no going back. The only way to reproduce the exact results from the book is to revert to previous versions of Scikit-Learn and TensorFlow. However, I don't recommend this solution. It's preferable to just accept the fact that there will be (hopefully small) differences between the text and the results you get. In the short term, I'll add warnings to the Jupyter notebooks to explain that the results might differ slightly from the book (and explain why). Then when I have time, I'll run all the notebooks using the latest version of all libraries, and I'll update all the code examples in the book that need to be changed. Oh wow... This book is so much work... sigh... ;-) Thanks again for your help.	Peter Drake	Feb 26, 2020	Mar 13, 2020
	ch 10 Second bullet under "Creating the model using the sequential API"	You say that if Flatten "receives input data X, it computes X.reshape(-1, 1)". It applied to an individual data point (e.g., a Fashion MNIST) image, wouldn't this turn the image into a column vector? Don't we want (1, -1) or, better yet, (-1,), to turn it into a row? This situation gets even more complicated if X is an entire input set, which is of shape (60000, 128, 128) in the Fashion MNIST example. We'd like it to end up (60000, 784), right? I can't see how (-1, 1) would do that. Note from the Author or Editor: Thanks a lot for your feedback. Indeed, this is an error. I should have written: "receives input data X, it computes X.reshape(-1, 28*28)". Fixed, thanks again!	Peter Drake	Mar 17, 2020	Aug 14, 2020
	Ch12 Below the walkthrough of custom loop	Below the walkthrough of custom loop, it says "If you set the optimizer’s clipnorm or clipvalue hyperparameter, it will take care of this for you." I'm not sure the "this" here is referring to the custom loop or the clipping. Maybe a little bit more explanation here. Note from the Author or Editor: Thanks for your feedback. Indeed, this sentence was not very clear. I replaced it with this sentence: """ If you want to apply Gradient Clipping (see Chapter 11), just set the optimizer's `clipnorm` or `clipvalue` hyperparameter. """ This works both when using model.fit() or when writing a custom loop. If you need any other transformation of the gradients when writing a custom loop, just modify the gradients before calling apply_gradients(). Thanks again.	Chih	Apr 10, 2020	Aug 14, 2020
	Ch13 Putting Everything Together	Figure 13-2 says that repeat() is called right after list_files(). But in the code block, repeat() is called after shuffle(). I know the effect of calling repeat() on a shuffled dataset is mentioned earlier in that chapter. Does the difference between figure and code matters? Note from the Author or Editor: Thanks for your feedback. Indeed, there's a mismatch between figure 13-2 and the code. It's a bit more common to place the repeat() step after the shuffle() step (as in the code). I'm not sure why I placed it in the wrong position (note that shuffle() and map() are also reversed, ooooh dear). I'll fix the figure to match the code. Note that there is a small difference between repeat().shuffle(...) and shuffle(...).repeat(). This is best explained with an example: >>> import tensorflow as tf >>> [i.numpy() for i in tf.data.Dataset.range(4).repeat(2).shuffle(4)] [2, 0, 3, 2, 3, 1, 1, 0] >>> [i.numpy() for i in tf.data.Dataset.range(4).shuffle(4).repeat(2)] [0, 2, 3, 1, 0, 1, 3, 2] Notice that in the first case, the number 2 is repeated twice before the number 1 appears. In the second case, the first 4 elements will always include 0, 1, 2, 3. Thanks again!	Anonymous	Apr 14, 2020	Aug 14, 2020
	Ch16 last line of the paragraph below figure16.9	The line says "the model would not be able to distinguish positions p = 25 and p = 35 (marked by a cross)." I think it should be p = 22 and p = 35 Note from the Author or Editor: Good catch, thanks.	Anonymous	May 04, 2020	Aug 14, 2020
Printed	Page xvii 3rd paragraph	The 3rd paragraph currently ends with the following: --in particular (hNumPy, pandas, and Matplotlib. There are two characters that are out of place "(h". It should be rewritten in one of the following two ways: (in particular, NumPy, Pandas, and Matplotlib). or --in particular, NumPy, pandas, and Matplotlib. Note from the Author or Editor: Thanks for your feedback. That's strange, I don't see this issue in my copy of the book (1st release of the 2nd edition). The source code (in AsciiDoc) for this paragraph is: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, http://numpy.org/[NumPy], http://pandas.pydata.org/[pandas], and http://matplotlib.org/[Matplotlib]. """ In printed copies, this should render as: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, NumPy (http://numpy.org/), pandas (http://pandas.pydata.org/), and Matplotlib (http://matplotlib.org/). """ This is exactly what I'm seeing in my printed copy. In electronic versions, you should see this: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, NumPy, pandas, and Matplotlib. """ In the "Version of product where error was found", you selected "Printed", but the text you are seeing looks like it's from the electronic version. Could you please confirm the version of the product (printed, ePub, etc.), and also specify which release you have? The release number can be found on the page immediately before the table of contents. Thank you. EDIT Apparently this typo was introduced during the production phase of one of the earlier releases, but it was quickly fixed. Sorry for the inconvenience.	Steve Anderson	Sep 17, 2020
Printed	Page xix 8th line	using covolutional neural networks... Note from the Author or Editor: Good catch, thanks! Fixed.	Laurent MAUUARY	Jan 08, 2021
	?? Right under "Training and Evaluating the Model"	When I fit the model (including on Google Colab), it shows progress out of 1719 rather than out of 55000 (as shown in the book), even though X_train has 55000 rows. What's going on? Note from the Author or Editor: Thanks for your question! Keras changed the way it displays progress during training since I wrote the book (after a bit of investigation, it looks like it happened in TensorFlow 2.2). Keras used to display the number of samples processed so far during the epoch (something like 38816/55000), but it now shows the number of batches processed so far. So if the batch size is 32 (which is the default) then there are math.ceil(55000/32)=1719 batches per epoch, so you would see 1213/1719 (instead of 38816/55000). I'll update the book to show the new format. Thanks a lot! Cheers, Aurelien	Peter Drake	Jan 20, 2021
	Page ? Equations 2.1 and 2.2	Found these error in the version available on learning.oreilly.com: 1. Equation 2.1 for RMSE is missing parentheses around h(x(i))-y(i). Under the square root, instead of h(x(i))-y(i)^2, it should be (h(x(i))-y(i))^2 https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch02.html#rmse_equation 2. Equation 2.2 for MAE missing '\|' around h(x(i))-y(i). should be \|h(x(i))-y(i)\| inside the summation. https://learning.oreilly.com/library/view/Hands-On+Machine+Learning+with+Scikit-Learn,+Keras,+and+TensorFlow,+2nd+Edition/9781492032632/ch02.html#mae_equation Note from the Author or Editor: Thanks for your feedback. I fixed this error not long after you posted this erratum.	Kartikeya Jain	Jun 09, 2021
	Page xxiii 3rd paragraph,last 2 characters	I bought this printed book, which is a Chinese version.I found a wrong translation in preface part. In Chinese, square should be translated into “平方” instread of "立方". “平方” means aa. "立方" means aaa. Thus, here is a wrong translation. Note from the Author or Editor:* Thanks for the catch, I notified the translator, they will fix this issue.	LIU Jinzhang	Aug 08, 2021
	Page n/a text	Current Copy ...So you need to monitor your model’s live performance. But how do you that? Well, it depends. In some cases, the model’s performance can... Suggested "live performance. But how do you that? Well, it depends." should be "live performance. But how do you do that? Well, it depends. Note from the Author or Editor: Good catch, thanks! I just fixed this.	Anonymous	Mar 25, 2022
	Page Chapter 3, Page 86 Second and Third paragraph	when using Scikit_Learn ,helper function fetch_openml to download dataset: MNIST, on page 86, the second paragraph : Let's look at these arrays: X,y = mnist["data"], mnist["target"] In fact,I run this code on my jupyter notebook, X,y are not arrays, X is a pandas DataFrame, y is a pandas Series. So the third paragraph the codes such as : some_digit=X[0] results Keyerror. Note from the Author or Editor: Thanks for your feedback. This issue comes from the fact that fetch_openl() used to return NumPy arrays, but at one point after the book was published, it started returning Pandas DataFrames instead. So it's not an error, but rather an outdated piece of code. Luckily, there's a very simple fix: when calling this function, you can specify the argument as_frame=False, and this will make the function return NumPy arrays, just like it used to, and all the code will work fine. I updated the code in the book and in the notebooks a couple years ago to add as_frame=False, so I'm guessing you have an older release of the book. It's pretty inevitable that some code will break over time, since libraries evolve. That's why I try to keep the notebooks as up to date as I can. If you run into broken code, please check the notebooks, as they'll usually contain the up-to-date code as well as a comment explaining why the code is different from the book. Hope this helps!	Alex Young	Aug 24, 2022
	ch 10 In the paragraph just before Figure 10-9.	"so" seems a typo in the first sentence : If each instance can belong only so a single class, out of 3 or more possible classes... Note from the Author or Editor: Nice catch, I just fixed this typo, thanks a lot.	Ami Ka	Apr 10, 2019	Sep 05, 2019
	ch 10 under "COMPILING THE MODEL"	It seem's "sigmoid_crossentropy" is mistakenly used instead of "binary_crossentropy" in this sentence: If we were doing binary classification (with one or more binary labels), then we would use the "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "sigmoid_crossentropy" loss. Note from the Author or Editor: Good catch, thanks a lot, I just fixed this.	Ami Ka	Apr 11, 2019	Sep 05, 2019
	ch 11 before Unsupervised Pretraining	In a parenthesis: (which may be due to shear luck) shear luck to sheer luck Note from the Author or Editor: Indeed, it should be sheer instead of shear, thanks!	Ami Ka	Apr 28, 2019	Sep 05, 2019
	ch 11 Avoiding Overfitting Through Regularization>Learning Rate Scheduling>Power scheduling	Probably "k" in the formula should be replaced by "s". Set the learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/k)c. The initial learning rate η0, the power c (typically set to 1) and the steps s are hyperparameters. The learning rate drops at each step, and after s steps it is down to η0 / 2. After s more steps, it is down to η0 / 3. Then down to η0 / 4, then η0 / 5, and so on. As you can see, this schedule first drops quickly, then more and more slowly. Of course, this requires tuning η0, s (and possibly c). Note from the Author or Editor: Great catch thanks! Indeed, it should be η(t) = η0 / (1 + t/s)c	Ami Ka	May 01, 2019	Sep 05, 2019
	ch 11 Dropout>Note	However, it you double it, inference time will also be doubled. to However, if you double it, inference time will also be doubled. Note from the Author or Editor: Thanks a lot, indeed, it's a typo. I just fixed it: should be "if you double" rather than "it you double".	Ami Ka	May 07, 2019	Sep 05, 2019
	ch 14 Convolutional Layer>TensorFlow Implementation>padding	Where "same" padding is explained( in parenthesis): ...In this case, the number of output neurons is equal to the number of input neurons divided by the stride, rounded up (in this example, 13 / 5 = 2.6, rounded up to 3). The mentioned example in the parenthesis doesn't have any number for the input size and also the stride is 1. Probably you meant the next example in Figure 14-7. Note from the Author or Editor: Great catch, thanks! I changed the bullet point like this: If set to "same", the convolutional layer uses zero padding if necessary. The output size is set to the number of input neurons divided by the stride, rounded up. For example, if the input size is 13 and the stride is 5 (see Figure 14-7), then the output size is 3 (i.e., 13 / 5 = 2.6, rounded up to 3). Then zeros are added as evenly as possible around the inputs, as needed. When `strides=1`, the layer's outputs will have the same spatial dimensions (width and height) as its inputs, hence the name _same_. Cheers, Aurélien	Ami Ka	Jun 28, 2019	Sep 05, 2019
	Ch10 Above Figure 10-15	Inputs A and B, shape attributes are wrong (should be 6, 5 not 5, 6) Note from the Author or Editor: Great catch, thanks! The problem was in the previous sentence, it was: "For example, suppose we want to send five features through the deep path (features 0 to 4), and six features through the wide path (features 2 to 7):" but the words "deep" and "wide" should have been reversed: "For example, suppose we want to send five features through the wide path (features 0 to 4), and six features through the deep path (features 2 to 7):" Thanks again, Aurélien	MNK	Jul 03, 2019	Sep 05, 2019
	Ch 16 Figure 16-9.	Sine/cosine positional embedding matrix (transposed, bottom) and a focus on two values of i (top) I think "bottom" and "top" are switched Note from the Author or Editor: Great catch, indeed they were. I just fixed this, thanks a lot!	Christopher Akiki	Jul 04, 2019	Sep 05, 2019
ePub	Page Ch14 CNN to tackle Fashion MNIST	padding='same'), instead of: padding='same',), Note from the Author or Editor: Good catch, not sure why there were extra commas there, I'm guess I changed the order of the arguments. Side-note: as you may know, having a comma before the closing parenthesis is actually valid Python code (ugly code, but valid). It's even required for tuples with a single element, such as (42,). I also use this in lists, tuples or argument lists spanning multiple lines, such as this: a = ( "apples", "cherries", "bananas", ) This makes it easier to move lines around without getting syntax errors. But ...padding='same',) really does not make much sense. Cheers, Aurélien	MNK	Jul 05, 2019	Sep 05, 2019
	Chapter 9 Paragraph before Figure 9.1	" This is where clustering algorithms step in: many of them can easily detect the top-left cluster. It is also quite easy to see with our own eyes, but it is not so obvious that the lower-right cluster is composed of two distinct sub-clusters." This description DOES NOT MATCH THE FIGURE. The TOP-RIGHT CLUSTER has two distinct sub-groups and the LOWER-LEFT CLUSTER easily stands out by itself. So as written, the text has a VERY confusing lack of correspondence with the figure. Note from the Author or Editor: Great catch, thanks a lot! Indeed, it should say "lower-left cluster" and "upper-right cluster", respectively. Here's the full correct sentence: This is where clustering algorithms step in: many of them can easily detect the lower-left cluster. It is also quite easy to see with our own eyes, but it is not so obvious that the upper-right cluster is composed of two distinct sub-clusters. Thanks again! Aurélien	Jim Lewis	Aug 11, 2019	Sep 05, 2019
	1 First line.	First sentence reads... "When most people hear 'Machine Learning,' they picture a robot: a dependable butler or a deadly Terminator, depending on who you ask." It's not "...who you ask," it's "... whom you ask." Should use proper English, at least in the very first sentence of the book. You would not say "You ask he," you'd say "You ask him." Note from the Author or Editor: Thanks for your feedback. As you might know, I am French, so please forgive my English mistakes. The he/him rule is very helpful. It's interesting that no one pointed out this error to me before, even though it's in the very first sentence! :) I think it goes to show that people are getting used to this mistake, to the point that many people on the Web seem to argue that "whom" now sounds too formal. Perhaps in a few decades it will no longer be considered a mistake. That said, of course, I've fixed the book now, thanks again!	Anonymous	Mar 21, 2020	Aug 14, 2020
	1 Chapter 3 - Threshold test	The following code is used to describe the effect of threshold adjustments on the recall. >>> threshold = 8000 >>> y_some_digit_pred = (y_scores > threshold) >>> y_some_digit_pred array([ True]) The result should be array([False]), as indicated on the GitHub project: https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb An output of 'array([ True])' would indicate that adjusting the threshold had no impact on the recall. Note from the Author or Editor: Great catch! Indeed, this was a copy/paste error, thanks for spotting it, I just fixed the book, the fix will be in the next release. I wrote a script that verifies that all the code examples in the book are present in the notebook, but right now it does not look at the outputs, I'll fix that. Thanks again! Aurélien	Hussein Khalil	Mar 25, 2019	Sep 05, 2019
	3 Chapter 3. Classification / Confusion Matrix / Equation 3-1. Precision	Sorry about my language. In Chapter 3. Classification / Confusion Matrix / Equation 3-1. Precision and Equation 3-2. Recall and Equation 3-3. F1 I do not see the division sign. Can you check all equations? Note from the Author or Editor: Thanks for your feedback. I'm guessing you are reading the book on the Safari Platform using the Chrome browser. Unfortunately, Chrome stopped supporting MathML, so the equations don't display properly. O'Reilly is working on fixing this, and I asked them to add a message to warn users. In the meantime you can work around this issue by using another browser: Firefox or Safari. Thanks for your understanding. 10/18/2019: the issue is now fixed in Chrome.	Alexander Morozov	Oct 16, 2019	Oct 18, 2019
PDF	Page 14 First paragraph - First line	an additional "ag" next to "is" : "Reinforcement Learning isag a very" -> "Reinforcement Learning is a very" Note from the Author or Editor: Good catch, thanks. I fixed this typo, it should be fine now in the electronic versions, and it will be correct in the 2nd release of the book (printed in October).	Safouane Chergui	Oct 07, 2019	Oct 18, 2019
Printed	Page 14 2nd line	Reinforcement Learning isag a very different beast. Note from the Author or Editor: Good catch, thanks!	Laurent Mauuary	Jan 11, 2021
	Page 22 Below pic 1-19	The decimal separator in the German version is wrong. In German, „,“ is the decimal separator but in 22,587 the „.“ must be used. (As it is not intended to be a decimal separator) Also it is used inconsistently in that spot. Note from the Author or Editor: Thanks for your feedback. I reported this error to the editor of the German translation, I expect they will fix it in the next reprint.	Alexander Trümper	Oct 10, 2021
PDF	Page 30 Bullet pt listing in "Underfitting the Training Data" section	The list of methods to counter underfitting is in plain text, while the analogous list with regards to overfitting in the previous section was highlighted in a warning/caution frame; might want to adjust. Note from the Author or Editor: Thanks, good point. I'll change the underfitting section to use a warning frame.	Hieronim Kubica	May 31, 2019	Sep 05, 2019
	Page 41 Penultimate paragraph	If the vector contains n elements then, v_{0} should be replaced by v_{1} and keep the v_{n} term, or if the v_{0} notation is preferred then v_{n} should be replaced by v_{n-1}. Note from the Author or Editor: Good catch, thanks a lot! I replaced v_{0}, v_{1}, ..., v_{n} with v_{1}, v_{2}, ..., v_{n}	Daniel Lopez Aguayo	Jun 06, 2021
Printed,	Page 46 Section "Download the Data"	The function "fetch_housing_data()" needs the following import statement: import urllib.request rather than simply the 'import urllib' statement. As written, the function will generate the error: AttributeError: module 'urllib' has no attribute 'request'. I'm guessing this all works OK in Jupyter Notebook (haven't tried it, but I'm guessing that's why no one has heretofore commented), but the code as written won't work as a "normal" python script or import at the REPL (the import will work, but call the function will throw the error listed above). Thanks. Note from the Author or Editor: Thanks for your feedback. Before Python 2 was deprecated, I made sure that all the notebooks supported both Python 2 and Python 3. For this, I had a few special imports, including: from six.moves import urllib This automatically imports both urllib and urllib.request. When Python 2 was terminated, I dropped Python 2 support and I removed the special compatibility code. In particular, I replaced the import above with: import urllib Unfortunately, this does not import urllib.request. Sorry about that! I've updated the notebooks and the book a few months ago. Thanks again! Aurelien	Andrew Boudreau	May 29, 2021
	Page 46 last paragraph	The last paragraph on the page says "When you call fetch_housing_data(), it creates"... I read it as a prediction that I would call it in a later cell, and worked on until I got to figure 5, ran cell 5 and got an error. Oops! Back to cell 2 to append a call to fetch_housing_data() at the top level . Then it worked. As this is the first function in the whole exercise, you might want to show the call, and tell the reader "Now run the cell, and when fetch_housing_data() runs, it will create"... Note from the Author or Editor: Thanks for your feedback. Indeed, you're not the only one who didn't call fetch_housing_data(), I wasn't clear enough. I fixed this in the 3rd edition by having a single function (to download and load the data) and by explicitly showing that you need to call it.	Dave Collier-Brown	Oct 16, 2021
PDF	Page 47 End of virtualenv box	This is an error of omission. If we are going to be using jupyter in a virtual environment. Then we must also setup jupyter to use the libraries associated with said environment. The requires the following two steps $ python3 -m pip install -U ipykernel $ python3 -m ipykernel install --user --name=my_env After that, when starting jupyter you can select "my_env" and start working in that environment. Note from the Author or Editor: Thanks Mohammed, great catch! Since the ipykernel package is installed automatically along with jupyter, the first command is not required, but the second is important (at least if you plan to have more than one virtualenv, which is the whole point). I updated the book like this: -------------------------------------------- $ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn Collecting jupyter Downloading https://[...]/jupyter-1.0.0-py2.py3-none-any.whl Collecting matplotlib [...] If you created a virtualenv, you need to register it to Jupyter and give it a name: $ python3 -m ipykernel install --user --name=python3 Now you can fire up Jupyter by typing the following command: $ jupyter notebook [...] Serving notebooks from local directory: [...]/ml [...] The Jupyter Notebook is running at: [...] http://localhost:8888/?token=60995e108e44ac8d8865a[...] [...] or http://127.0.0.1:8889/?token=60995e108e44ac8d8865a[...] [...] Use Control-C to stop this server and shut down all kernels [...] -------------------------------------------- Notice that I removed this section: -------------------------------------------- To check your installation, try to import every module like this: $ python3 -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn" There should be no output and no error. -------------------------------------------- This is because I didn't want the layout of the book to be affected too much, and this paragraph is not necessary since users will notice if there are errors in the previous steps. Again, thanks a lot for your great feedback!	Mohammed El Beltagy	Oct 15, 2019	Nov 22, 2019
	Page 55 1st code sample	The split method returns a generator that produces index values, not labels. Therefore, both instances of "housing.loc" should be "housing.iloc". Note from the Author or Editor: Great catch, thanks. I didn't notice this issue because in this case, it works fine, but I agree that in general readers should use housing.iloc[index] instead of housing.loc[index]. I updated the book and the notebook.	Peter Salveson	Aug 16, 2022
Printed	Page 67 Second paragraph	"After one- hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for one 1 per row." The resulting matrix has thousands of ROWS, but only 5 columns. The code output directly after this text gives an example. Note from the Author or Editor: Thanks for your feedback. I see how this paragraph can be confusing. Please let me clarify. The paragraph starts with: """ Notice that the output is a SciPy _sparse matrix_, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. [...] """ My goal here was to explain that one-hot encoding categorical attributes with thousands of categories will result in a matrix with thousands of columns, in which case it's useful to have a sparse matrix, and that's the reason why the `OneHotEncoder` produces a sparse matrix. The sentence "After one-hot encoding, we get..." is in the context of the previous sentence "This is very useful when you have categorical attributes with thousands of categories." But I see how it's possible to interpret the sentence "This is very useful..." as a side comment, independent from the following sentence. In this case, "After one-hot encoding..." would seem to refer to the actual output of the previous code example. I've rephrased the paragraph to make it clearer: """ Notice that the output is a SciPy _sparse matrix_, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories, since in this case one-hot encoding will produce a matrix with thousands of columns, and this matrix would be full of 0s, except for a single 1 per row. Using tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements. You can use it mostly like a normal 2D array, but if you really want to convert it to a (dense) NumPy array, just call the `toarray()` method: """ Thanks again for your feedback! Cheers, Aurelien	Rory Gamble	Mar 09, 2021
Printed	Page 86 Last line	Just a tiny detail here. There is an "import" command missing before the last instruction of the page. NumPy was not loaded yet. Note from the Author or Editor: Good catch, thanks. In later chapters I did not repeat all the imports, because I though it was redundant (after a while, I assume the reader understands what np stands for and how to import it), but in the earlier chapters, it's useful to spell everything out. Fixed. :)	Bruno Machado	Apr 02, 2020	Aug 14, 2020
ePub	Page 86 some_digit = X[0]	some_digit = X[0] ...causes the exception at the bottom of this text. The code instead should be the following: some_digit = X.values[0] ----------------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/ml/env/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2897 try: -> 2898 return self._engine.get_loc(casted_key) 2899 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 0 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-18-b7a6042a4eea> in <module> 1 import matplotlib as mpl 2 import matplotlib.pyplot as plt ----> 3 some_digit = X[0] 4 some_digit_image = some_digit.reshape ( 28 , 28 ) 5 plt.imshow ( some_digit_image , cmap = mpl.cm.binary , interpolation = "nearest" ) ~/ml/env/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key) 2904 if self.columns.nlevels > 1: 2905 return self._getitem_multilevel(key) -> 2906 indexer = self.columns.get_loc(key) 2907 if is_integer(indexer): 2908 indexer = [indexer] ~/ml/env/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2898 return self._engine.get_loc(casted_key) 2899 except KeyError as err: -> 2900 raise KeyError(key) from err 2901 2902 if tolerance is not None: KeyError: 0 Note from the Author or Editor: Thanks for your feedback. Since Scikit-Learn 0.24, `fetch_openml()` returns a Pandas `DataFrame` by default, instead of a NumPy array. To avoid this and keep the same code as in the book, just specify `as_frame=False` when calling `fetch_openml()`. Unfortunately, that's not something that I could have foreseen when writing the book, as version 0.24 was released afterwards. Other little things like that may break over time, so when you run into an issue, please check the notebooks in github project (https://github.com/ageron/handson-ml2): I try to keep them as up to date as I can. For example, there's a warning about the fetch_open() function in the notebooks, and they use as_pandas=False. That said, I also updated the book so that future releases will use as_pandas=False as well. Hope this helps, Aurelien	Francis Lui	Dec 29, 2020
	Page 91 3º paragrph	Error in the spanish version of the book. Instead of ...Si eliges la version 3 deberias calcular el valor medio....It should be...Si eliges la version 3 deberias calcular la mediana... The code uses the median but the text talks about the mean in the spanish version( in english version is ok). Further on it also talks about saving the average value, when it's clear that it is misleading it with the median value. Note from the Author or Editor: Thanks for your feedback. I sent your feedback to the editor of the Spanish version, so I expect they will fix the error quickly (unless they've already done so).	Jose Miguel Gimenez	May 14, 2021
	Page 98 end of tip/suggestion element	With improvement, the PR curve would be closer to the the top RIGHT corner, not top LEFT. The top left corner is optimal for an ROC curve. Note from the Author or Editor: Good catch, thanks. I fixed the book.	Omer Lang	Jun 17, 2021
Printed	Page 138 2nd paragraph	It says : "... the dashed line in the righthand plot in Figure 4-18 (with alpha = 10^-7) looks quadratic, almost linear." Actually, it does not look quadratic (maybe cubic?). Also, it is quite disputable that is looks "almost linear". Note from the Author or Editor: Indeed, good catch! You just made me realize that this figure changed slightly between the first edition and the second edition of the book, probably because of slight tweaks in Scikit-Learn's algorithms. Here is what the figure looks like in the first edition: https://snipboard.io/fBgiRw.jpg I've fixed the sentence to say "looks roughly cubic". Thanks again!	Ian Beauregard	Aug 13, 2020	Aug 14, 2020
Printed	Page 142 last line of code snippet before "Logistic Regression" heading	In the example implementation of early stopping, when a model with less error is encountered, the best_model variable is set to a clone of the current model. As I understand it, a clone is an duplicate of the model without data. Should best_model instead be set to a deepcopy of the current model, which includes the data (in particular the trained coefficients)? Note from the Author or Editor: Thanks for your feedback. Indeed, this was an error, I'm sorry about that. I fixed this error last year. So instead of: from sklearn.base import clone ... best_model = clone(sgd_reg) The code is now: from copy import deepcopy ... best_model = deepcopy(sgd_reg) If you run into an error in the code, please check the notebooks in my github repository at https://github.com/ageron/handson-ml2, as I try to keep them up to date, fixing bugs and updating to the latest libraries. Thanks again! Aurelien	Ben	Jan 02, 2021
Printed	Page 143 Eq 4-13	(3rd release) In Eq 4-13, bottom line of p143 and Eq 4-19, x^T \theta^{(k)} is used But for matching the order of theta and x in other places, I suggest (\theta^{(k)})^T x or \theta^T x Thanks Note from the Author or Editor: Thanks for your suggestion, I fixed the 3 instances you pointed out. FYI, I hesitated between "x^T theta" and "theta^T x" because the first linear equation in chapter 1 is written y = theta0 x0 + theta1 x1 + ..., which naturally translates to y = theta^T x. It would be weird to write y = x0 theta0 + x1 theta1 + ... However, when dealing with matrices, one typically writes y = X W: here, X has to appear first (and there's no transpose), because each row of X already corresponds to a transposed feature vector. I remember being confused the first time I saw this, so I wanted to quickly transition from theta-first to X-first. However, I was not careful enough, so I ended up having a confusing mixture of both! Oops... I think you're right that consistently using theta-first before we really tackle matrices is probably better.	Haesun Park	Mar 03, 2020	Aug 14, 2020
Printed	Page 154 Right graph of Figure 5-2.	The x-label "x0" must be replaced by "x'0" as both variables x0 and x1 are scaled in this graph. Same in the corresponding notebook 05_support_vector_machines.ipynb. Note from the Author or Editor: Good catch, thanks! I just fixed the book and the notebook.	Anonymous	Nov 26, 2020
Printed	Page 158 Last sentence	The book says : "The hyperparameter coef0 controls how much the model is influenced by high-degree polynomials versus low-degree polynomials." I think it should say high-degree and low-degree TERMS instead of polynomials. Note from the Author or Editor: Good catch, thanks. I changed that sentence to: """ The hyperparameter `coef0` controls how much the model is influenced by high-degree terms versus low-degree terms. """	Ian Beauregard	Aug 15, 2020	Sep 18, 2020
Printed	Page 161 1st paragraph, above the figure	In chapter 5, pages 160 and 161, it says: So γ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and if it is under?fitting, you should increase it (similar to the C hyperparameter). As far as I know, to avoid overfitting, we must apply limitations to the method (increasing regularization) and vice-versa. It is also stated in the solution of exercise 9 in chapter 4. Note from the Author or Editor: Thanks for your feedback. By "regularization hyperparameter", I just meant that it is a hyperparameter that lets you control regularization. Perhaps for more clarity I should have said that it is a "reverse regularization hyperparameter", since reducing it increases regularization. I'll update the book.	Sajjad	Jan 22, 2020	Mar 13, 2020
PDF	Page 165 Under Equation 5-2	The following sentence: Figure 5-12 shows the decision function that corresponds to the model in the LEFT in Figure 5-4 Should be: Figure 5-12 shows the decision function that corresponds to the model in the RIGHT in Figure 5-4 This can be confirmed in the corresponding Jupyter notebook (https://github.com/ageron/handson-ml/blob/master/05_support_vector_machines.ipynb Input #10 and #31) which both of them are using the same variable name "svm_clf2". Note from the Author or Editor: Good catch, thanks. Indeed, it should be "right" instead of "left.	Nathan Young	Jun 15, 2020	Aug 14, 2020
Printed,	Page 173 First sentence at the top of the page, right underneath Equation 5-13.	After presenting Equation 5-13 (labelled: "Linear SVM classifier cost function"), the paragraph reads as follows: "The first sum in the cost function will push the model to have a small weight vector w, leading to a larger margin. The second sum computes the total of all margin violations." In the equation, there is only one summation. I believe what is meant to be said is that the first "term" of the cost function is responsible for the margin, and the second "term" of the cost function (which is the summation) is responsible for minimizing margin violations. When you refer to them as "first sum" and "second sum" it makes one think there should be two summations in the equation. Thank you! Note from the Author or Editor: Thanks for your feedback. I think I wrote "first sum" and "second sum" because in my mind the first term (1/2 w^T w) is actually a summation, since it is equal to 1/2 * (w_1^2 + w_2^2 + w_3^2 + ... + w_n^2). It's half of the sum of squares of the elements of w. But I agree that it's really not clear right now, so I'll write "first term" and "second term" instead, thanks again!	AJ	Nov 09, 2020
Printed	Page 184 Equation 6-4. CART cost function for regression	I think you need to divide the MSE equation by m. The current equation represents RSS. Some machine learning books use RSS in this case. However, scikit-learn uses MSE. https://scikit-learn.org/stable/modules/tree.html Note from the Author or Editor: Good catch, thanks. The definition of MSE_node should be divided by m_node. Fixed! You can see the new equation at https://github.com/ageron/handson-ml2/blob/master/book_equations.pdf	Rafael	Feb 04, 2021
Printed	Page 197 1st paragraph in "Random Forests". 2nd sentence	The sentence reads "Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the RandomForest classifier class, [..]" The word instead is used twice in the same sentence. It should probably read "Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can use the RandomForest classifier class, [..]" Note from the Author or Editor: Good catch, thanks. Instead of two insteads, I prefer a single one instead. ;-)	Ricardo Blasco	Nov 19, 2020
Printed	Page 203 4th Paragraph	In this paragraph, it says: "Let's go through a simple regression example, using ..... (of course, Gradient Boosting also works great with regression tasks)." Instead of "regression tasks" (in the parentheses), it should probably say "classification tasks". Thanks! Note from the Author or Editor: Good catch, thanks, that's what I meant. Fixed. :)	AJ	Nov 19, 2020
PDF	Page 211 First paragraph	brew is deprecated and its github repo recommends DESlib as an alternative (https://github.com/scikit-learn-contrib/DESlib) Note from the Author or Editor: Thanks for your feedback, indeed brew is deprecated and DESlib looks like a great replacement. I updated the book, hopefully the change will make it to the 2nd release (printed this week), or else it will be the 3rd release.	Safouane Chergui	Oct 13, 2019	Oct 11, 2019
Printed	Page 233 Figure 8-13 Description	First, thank you for this amazing piece of work! I found a typo on page 233 in the Print format in the figure's description. It reads "Using various techniques to reduce the Swill roll to 2D", but it should be "Swiss roll" of course. Note from the Author or Editor: Good catch, thanks! :)	Alan Joonatan Rebane	Nov 26, 2020
	Page 243 2nd paragraph	When introducing inertia, you say "That metric is called the model’s inertia , which is the mean squared distance between each instance and its closest centroid." However, as I understand it, inertia is not a MEAN, but just the sum of the squared distances. (The same error occurs in the German translation.) Note from the Author or Editor: Good catch, thanks! The notebook was already correct, but the book had the mistake, so I just fixed it. Note that minimizing the mean of the squared errors leads to the same solution as minimizing the sum of the squared errors, so we would get the same result if the inertia was defined as the mean rather than the sum. Thanks again!	Angelo Profeta	Nov 25, 2021
PDF	Page 245 second black dot	There should be a sign of devision "/" between D(x(i))2 and sum_{j=1}^{m} D(x(j))2 in K-Means++ initialization algorithm. Note from the Author or Editor: Great catch, thanks. This was a latexmath rendering issue, I just fixed it.	Hao	May 20, 2019	Sep 05, 2019
Other Digital Version	251-252 first paragraph	Chapter 9 "Using Clustering for Preprocessing" talks about clustering as an efficient approach to dimensionality reduction. With the example chosen, without performing a preclustering on the training data, each data has 64 features. If we perform a preclustering (via a pipeline) with 50 clusters, this is effectively a dimensionality reduction as 50<64. But at the end of the section, if we eventually keep k = 99, can we still speak of a dimensionality reduction? However, I recognize that the accuracy gets better. Note from the Author or Editor: Thanks for your feedback. Indeed, you're absolutely right: it's not dimensionality reduction anymore if we keep k=99 while the original dimensionality was 64. :/ This section definitely deserved a bit of clarification, so I changed the introduction from: """ Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing step before a supervised learning algorithm. """ to: """ Clustering can be an efficient preprocessing step before a supervised learning algorithm. """ Then later in the section, right after the sentence "How about that? We reduced the error rate by almost 30% (from about 3.1% to about 2.2%)!", I added the following sentence: """ The clustering step reduces the dataset's dimensionality (from 64 to 50 dimensions), but the performance boost comes mostly from the fact that the transformed dataset is closer to being linearly separable than the original dataset, and therefore it is much easier to tackle with Logistic Regression. """ And I removed "But" in "But we chose the number of clusters k arbitrarily". Hopefully this will be much clearer. Thanks again for your helpful feedback.	Olivier Lourme	Jun 25, 2020	Aug 14, 2020
Printed	Page 285 Ch 10, page 285, last phrase	In the book it is said that a Perceptron with two inputs and three outputs with a step function is a multioutput classifier. I think this Perceptron is a multilabel classifier, indeed each output is binary and not number. Note from the Author or Editor: Good catch! Indeed, the sentence should be: """ This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multilabel classifier. """ Thanks a lot!	Chakib BELAFDIL	Sep 04, 2020	Sep 18, 2020
Printed	Page 286 Equation 10-2	In this equation, the argument to phi is written as XW + b. Here the product will have m rows (one for each instance) and n_out columns (one for each AN in the output layer). It seems to me that the addition in this expression can only be correctly understood in the context of (something like) Numpy broadcasting rules which will operate on b so that it's the same shape as the result of the XW product. Since this isn't written as part of a code snippet, I suggest adding something the the third bullet of the explanation of the equation to make it clear what's going on. Somewhat similarly, the application of phi to a matrix with shape (m, n_out) to get another (m, n_out) matrix is pretty clear in the context of Numpy code, but less clear here. Maybe something like "Here \phi is being applied to each element separately." could be a good addition to the 4th bullet? Thanks for a terrific book! Note from the Author or Editor: Thanks for your feedback. I thought I explained broadcasting earlier in the book, but I couldn't find where. The only mentions I found are in Appendix A (in the solution to exercise 10.6) and in chapter 16 (when discussing Positional Encoding). So I added a footnote when introducing the bias vector b. I wrote this: "In mathematics, the sum of a matrix (XW) and a vector (b) is undefined. However, in Data Science, we allow "broadcasting": we add the vector to every row in the matrix." Thanks again!	Ken Basye	Mar 12, 2020	Aug 14, 2020
Printed	Page 289 2nd paragraph	The sentence starting with "The field of Deep.." is incomplete or has confusing syntax. Note from the Author or Editor: Thanks for your feedback. The sentence you are referring to is: "The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations." At first I really couldn't see what was confusing about this sentence, but then I noticed that the word "models" could be interpreted as a verb. If you read it as a verb, then indeed the sentence makes no sense at all. It sounds like "Deep Learning studies DNNs, and more generally Deep Learning models X", but X is nonsensical ("containing deep stacks of computations"!?). But in fact the word "models" should be read as a noun: so the sentence says that "Deep Learning studies DNNs, and more generally it studies models that contain deep stacks of computations". I've added a few words to make the sentence clearer: "The field of Deep Learning studies DNNs, and more generally it is interested in models containing deep stacks of computations". Hope this is clearer!	Kayla Pennerman	Mar 26, 2021
Printed,	Page 302 Last paragraph on page	Instead of "If we were doing binary classification (with one or more binary labels)" This should be "If we were doing multi-label classification (with one or more binary labels)" Note from the Author or Editor: Thanks for your feedback. Indeed, this could have been clearer. I changed the sentence to: "If we were doing binary classification or multilabel binary classification"	Hamel Husain	Apr 03, 2020	Aug 14, 2020
Printed	Page 304 3rd paragraph	(2nd release) "... set the `sample_weight` arguement (it supersedes `class_weight`)." Acually tf.keras use `sample_weight` x `class_weight`. Please check https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training_utils.py#L1035 Thanks. Note from the Author or Editor: Great catch! I replaced "(it supersedes `class_weight`)" with "(if both `class_weight` and `sample_weight` are provided, Keras multiplies them)". Thanks for your help!	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 306 Using the model to make predictions.	We scaled the training/validation set features by dividing by 255.0. To obtain accurate performance metrics on the test set, we should also apply the same pre-processing step. current code: model.evaluate(X_test, y_test) ... X_new = X_test[:3] y_proba = model.predict(X_new) Note from the Author or Editor: Good catch, thanks! In the Jupyter notebook, the test set is properly scaled, but for some reason I did not include that line in the book. On page 298, just after scaling the training set and the validation set, I just added the following line in the book: X_test = X_test / 255.0	Francisco Javier Perez Leon	Jan 14, 2020	Mar 13, 2020
Printed	Page 306 1st paragraph	The book says "you should be able to reach close to 89% validation accuracy" if you continue training. However, on page 304, before the tip, the book says that the validation accuracy already reached 89.26% after 30 epochs. Training for 30 more epochs, I got 89.42% accuracy. Note from the Author or Editor: Good point, thanks. I replaced 89% with 89.4%.	Ian Beauregard	Sep 14, 2020	Sep 18, 2020
Printed	Page 314 Next to the last paragraph	In the code example for saving a model to HDF5 file, the first line should contain `keras.models.Sequential` instead of `keras.layers.Sequential`. Note from the Author or Editor: Great catch! Indeed, I meant to write `keras.models.Sequential` instead of `keras.layers.Sequential`. Thanks!	Dmitry Kabanov	Nov 14, 2019	Nov 22, 2019
Printed	Page 325 Question 10	I suggest replacing "98% precision" with "98% accuracy". Note from the Author or Editor: Good catch, thanks. I meant accuracy, not precision.	Ian Beauregard	Sep 17, 2020
Printed	Page 328 Exercise 2	A closing parenthesis is messing before the OR operator on the last line. Note from the Author or Editor: Good catch, thanks. This should indeed have been: A xor B = (A and not B) or (not A and B) Replacing "xor", "and" and "not" with the appropriate symbols. Fixed! :)	Ian Beauregard	Sep 11, 2020	Sep 18, 2020
Printed	Page 329-330 Last sentence	(2nd release) "... plotting the error, and finding the point where the error shoots up)." I think that it's better 'loss' instead of 'error', because Learning Rate section use 'loss' to explain how to find learning rate. Thanks. Note from the Author or Editor: Good point, I replaced "error" with "loss" in this sentence.	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 329 and 731 Exercise 6 and solution to Exercise 6	"Weight vector" should be replaced by "weight matrix" on both pages 329 and 731. On page 731, the first sentence following the colon should probably get its own item in the list (letter 'a') and on the last item in the list, Y should be boldfaced (now printed as Y). Note from the Author or Editor:* Great catches! Yes, I should have written "weight matrix" instead of "weight vector" on pages 329 and 731. I fixed the first formatting issue in February, it should be fine in the latest releases of the book. I just fixed the second issue (the Y in the last bullet point should be a boldface Y, not Y*). Thanks a lot.	Ian Beauregard	Sep 16, 2020
	Page 337 1st bullet point	It should be "The hyperparameter α defines the opposite of the value" instead of "The hyperparameter α defines the value". Note from the Author or Editor: Good catch, thanks! Indeed, it defines the opposite of the value.	Anonymous	Aug 27, 2021
Printed	Page 338 1st line under 1st code block	(3rd release) "LeakyRelu(alpha=0.2)" should be "LeakyReLU(alpha=0.2)". Thanks. Note from the Author or Editor: Good catch, thanks. It should indeed read LeakyReLU(alpha=0.2).	Haesun Park	Apr 03, 2020	Aug 14, 2020
Printed	Page 341 2nd line over the note.	(3rd release) For "TFLite's optimizer does this automatically", I suggest to change 'optimizer' to 'converter'. Because we often say TFLite converter as in Ch. 19 and It's better to avoid misunderstanding as Keras optimizers. Thanks. Note from the Author or Editor: Good point, it's clearer with "TFLite's converter". Thanks!	Haesun Park	Apr 03, 2020	Aug 14, 2020
Printed	Page 344 2nd to last paragraph	"... but the `fit()` method sets to it to 1" should be "... but the `fit()` method sets it to 1." Note from the Author or Editor: Good catch, thanks. Indeed, it should have been "...but the `fit()` method sets it to 1".	Ian Beauregard	Sep 23, 2020
PDF	Page 347 last paragraph	"you clone model A’s architecture with clone.model()" => clone_model() instead of clone.model() Note from the Author or Editor: Good catch, thanks! I fixed this typo, it should be good in the next reprint.	Safouane Chergui	Oct 19, 2019	Nov 22, 2019
Printed	Page 347 last paragraph	The following is the original last line of the last paragraph: To do this, you clone model A’s architecture with clone.model(), then copy its weights (since clone_model() does not clone the weights): But there is no such function clone.model() it should be clone_model(). Note from the Author or Editor: Great catch, thanks. Indeed, it should be clone_model(), not clone.model().	Dhruba Ray	Jul 12, 2020	Aug 14, 2020
Printed	Page 354 Figure 11-6	I suggest adding a negative sign before η∇_1 and η∇_2. Note from the Author or Editor: Oh yikes, you're absolutely right! Thanks, I'm updating the figure now.	Ian Beauregard	Sep 24, 2020
Printed	Page 356 Eq. 11-8	(2nd release) T shoud be t in 3rd and 4th eq., because next sentence is ".. t represents the iteration number (starting at 1).". Thanks. Note from the Author or Editor: Great catch! Indeed it should be a lowercase italic _t_. Thanks!	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 357 Below AdaMax	(2nd release) ".. the gradients in s (with a greater weight for more recent weights)." I'm not sure, but did it mean 'recent gradients'? Thanks Note from the Author or Editor: Great catch, I meant to write "recent gradients", not "recent weights". Thanks!	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 368 code example	In the code example on page 367 you create a sequential keras model called "model". On page 368 you call this model directly on the test set as follows: model(X_test_scaled, training=True) Perhaps I missed something, but I don't remember any explanation about what happens when you call a sequential model directly on a test set. (I'm assuming the model has been compiled and fit in the meanwhile, but that this code was omitted for brevity.) I would expect to see a method call, such as: model.evaluate(X_test_scaled, training=True) I expect this is just a typo (omitting the method)? If this is indeed the intended code, could you clarify what it means to call such a model directly? Thanks for the great book! Note from the Author or Editor: Thanks for your feedback. That's a great question. A Keras model can be used like a regular Keras layer (in Chapter 12, we see how this makes it possible to easily compose models containing other models). Just like any layer, you can thus pass any NumPy array or TF tensor to a model directly, using the model like a function (you can do this with any layer, as we saw in the Functional API): X = tf.constant([...]) # or np.array([...]) model(X) # returns a TensorFlow tensor model.predict(X) # returns a NumPy array model(X) is similar to model.predict(X) except it returns a TF tensor rather than a NumPy array. Another difference is that model(X) can be used in the Functional API, while model.predict(X) cannot. For example: input_A = keras.layers.Input(...) output_A = model(input_A) enclosing_model = keras.Model(inputs=[input_A, ...], outputs=[output_A, ...]) Lastly, the `training` argument is only available when using model(X), such as in model(X, training=True). This argument is not available when calling model.predict(X). To clarify this, I replaced the following sentences: """ We just make 100 predictions over the test set, setting `training=True` to ensure that the `Dropout` layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. Recall that `predict()` returns a matrix with one row per instance and one column per class. """ with these: """ Note that `model(X)` is similar to `model.predict(X)` except it returns a tensor rather than a NumPy array, and it supports the `training` argument. In this code example, setting `training=True` ensures that the `Dropout` layer remains active, so all predictions will be a bit different. We just make 100 predictions over the test set, and we stack them. Each call to the model returns a matrix with one row per instance and one column per class. """ Thanks again!	Willem	Apr 09, 2020	Aug 14, 2020
Printed,	Page 373 None	I have the printed book, and on the end of chapter 11, there is no questions 9, 10. And on the github and on the appendix A, there is reference for questions 9, 10. There is also no questions 9, 10 on the website. Note from the Author or Editor:* Thanks for your feedback. Indeed, I fixed the appendix A to say that the solution to question 8 is available on github (there are no questions 9 and 10). I also pushed the solution to this exercise on github.	Yagel	Dec 18, 2019	Mar 13, 2020
Printed	Page 379 Under 'Using TensorFlow like NumPy'	(2nd release) "A tensor is usually a multidimensional array (exactly like a NumPy ndarray), but it can also hold a scalar(a simple value, such as 42)". It seem to numpy can't hold a scalar, but as you may know there is a array scalar in numpy. ```python s = np.array(3) print(s, type(s)) ``` 3 <class 'numpy.ndarray'> Thanks. Note from the Author or Editor: Good point, people could indeed interpret this as meaning that NumPy does not support scalar. I rewrote the sentence like this: A tensor is very similar to a NumPy `ndarray`: it is usually a multidimensional array, but it can also hold a scalar (a simple value, such as `42`).	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 381 Last sentence in box	"Here is as simple example" should be "Here is a simple example." Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Sep 28, 2020
Printed	Page 383 RaggedTensor block	(3rd release) For "Represent static list of lists of tensors", what's the meaning of static list? As you know, raggedtensor is tensor like nested variable-length list. Also "every tensor has the same shape and data type", but list in raggedtensor can have different shape. Thanks. Note from the Author or Editor: Thanks for your question. By "static" I meant "immutable". I updated the book to remove the word "static", as most data structures are immutable anyway (except for Queues and TensorArrays).	Haesun Park	Apr 03, 2020	Aug 14, 2020
Printed	Page 386 1st bullet	(2nd release) In last sentence, other possible values are "sum" and "none" instead of None. Please check https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/losses/loss_reduction.py#L57 Thanks. Note from the Author or Editor: Good catch, indeed it should be: Other possible values are `"sum"` and `"none"`. Instead of: Other possible values are are `"sum"` and `None`. Thanks!	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 387 "# return value is just tf.nn.softplus(z)"	``` # The softplus function as defined is technically not equivalent to tf.nn.softplus. # The former is not numerically stable whereas the latter is. # Please refer to https://stackoverflow.com/questions/44230635/avoid-overflow-with-softplus-function-in-python # for details on numerically stable softplus, as well as the code below. # (Note: the code is mine; the stackoverflow answer is not.) import numpy as np import tensorflow as tf softplus_numpy = lambda a: np.log(np.exp(a)+1.0) softplus_tensorflow = lambda a: tf.math.log(tf.exp(a)+1.0) softplus_numpy_numerically_stable = lambda a: np.maximum(a,0)+softplus_numpy(-np.abs(a)) softplus_tensorflow_numerically_stable = lambda a: tf.maximum(a,0)+softplus_tensorflow(-tf.abs(a)) a = 10.0(np.arange(9)-4) a = np.array([a, -a]) print(a,'\n') print(softplus_numpy(a),'\n') print(softplus_numpy_numerically_stable(a),'\n') print(softplus_tensorflow(a),'\n') print(softplus_tensorflow_numerically_stable(a),'\n') print(tf.nn.softplus(a)-softplus_numpy_numerically_stable(a),'\n') print(tf.nn.softplus(a)-softplus_tensorflow_numerically_stable(a),'\n') ``` output: ``` [[ 1.e-04 1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03 1.e+04] [-1.e-04 -1.e-03 -1.e-02 -1.e-01 -1.e+00 -1.e+01 -1.e+02 -1.e+03 -1.e+04]] [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 inf inf] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]] [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 1.00000000e+03 1.00000000e+04] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]] tf.Tensor( [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 inf inf] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 1.00000000e+03 1.00000000e+04] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[ 1.11022302e-16 0.00000000e+00 -1.11022302e-16 1.11022302e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -5.88857305e-18 3.72007598e-44 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[ 1.11022302e-16 0.00000000e+00 -1.11022302e-16 1.11022302e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -5.88857305e-18 3.72007598e-44 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) /Users/mlhull5148/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: overflow encountered in exp if __name__ == '__main__': ``` Note from the Author or Editor:** Thanks for your great feedback and working code. Indeed, this numerical instability is definitely worth noting in the book. I replaced: def my_softplus(z): # return value is just tf.nn.softplus(z) return tf.math.log(tf.exp(z) + 1.0) With: def my_softplus(z): # note: tf.nn.softplus(z) better handles large inputs return tf.math.log(tf.exp(z) + 1.0) Thanks again! Aurelien	Chris Coffee	Aug 18, 2020	Sep 18, 2020
Printed	Page 392 Footnote 8	(2nd release) keras.activations.get() is available in multi-backend keras. Please check https://github.com/keras-team/keras/blob/master/keras/activations.py#L211 "You could use use `keras.activations.Activation` instead" should be "You could use use `keras.layers.Activation` instead". Thanks Note from the Author or Editor: Great catch, indeed I meant to write `keras.layers.Activation` instead of `keras.activations.Activation`, thanks!	Haesun Park	Nov 15, 2019	Nov 22, 2019
Printed	Page 421 code at top of the page	In the code for def csv_reader_dataset(...): dataset.shuffle(..) and .repeat(...) should be before .map(...) and .interleave(...), respectively, to agree with Figure 13.2 and avoid having to preprocess the whole shuffle buffer before being able to produce a batch. Note from the Author or Editor: Nice catch! I'll make the code consistent with the figure. :) Note however that it makes little difference in terms of performance: the map() function does not actually transform the whole dataset before passing the data on to the next step. Instead, it transforms just what is needed for the next steps, on the fly. You can think of each step as a queue which waits until a consumer (i.e., the next step) tries to pull elements out of it before it pulls elements from the previous queue. So it's easier to understand when you start from the end of the pipeline: looking at the code, the prefetch() method pulls from the batch() method, which pulls from the shuffle() method, which pulls rom the map() method, and so on. The batch() method only pulls as many elements as are required to fill the batch, so whether the map() method or the shuffle() method is first makes little difference: in both cases, if the batch size is 32, then only 32 items will be shuffled by the shuffle() method and preprocessed by the map() method (except when pulling the first batch, which requires first filling up the shuffle buffer, in both cases).	Wolfram Helwig	Jan 21, 2020	Mar 13, 2020
Printed	Page 439 1st paragraph	(3rd Release) "the final vector will be [1/log(200), 0/log(10), 2/log(100)]" is not correct. TextVectorization class use `log(1 + total_num_of_docs / (1 + num_of_docs_which_contain_word))` to compute IDF. Please check https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/tensorflow/python/keras/layers/preprocessing/text_vectorization.py#L770 Thanks. Note from the Author or Editor: Thanks a lot for your feedback. There are so many variants of TF-IDF. The TextVectorization class uses f * log(1 + N/(1+n)), where: * f is the number of occurrences of the term in the document * N is the total number of documents * n is the number of documents where the term occurs. This variant of TF-IDF is not listed in the TF-IDF Wikipedia page: https://en.wikipedia.org/wiki/Tf%E2%80%93idf The Term-Frequence part (f) is standard (it's called the "raw count" in the Wikipedia page), however the Inverse-Document-Frequency part (log(1+N/(1+n))) is not. It is close to log(N/n), which is the "default" IDF, but it uses 1+n instead of n, probably to avoid a possible division by zero, and it adds 1 to N/(1+n), probably to avoid approaching log(0). I think these extra +1s are a bit too much of a technical detail to mention in the TF-IDF paragraph in the book, but I changed the paragraph to present the proper IDF term log(N/n) rather than 1/log(n), which is not listed in the Wikipedia page (I remember trying to make things extra simple, but I probably went too far, as this variant is not listed in the Wikipedia page). Here is the updated paragraph: """ [...] This is often done using a technique called _Term-Frequency_ × _Inverse-Document-Frequency_ (TF-IDF). There are many variants, but a common one consists in computing the ratio of training instances in which the word appears, and multiplying the word count by the log of the inverse of that ratio. For example, let's imagine that the words `"and"`, `"basketball"`, and `"more"` appear respectively in 90%, 10%, and 50% of all text instances in the training set: in this case, the final vector will be `[1log(1/0.9), 0log(1/0.1), 2*log(1/0.5)]`, which is approximately equal to `[0.1, 0.0, 1.4]`. The `TextVectorization` layer will have an option to perform TF-IDF. """ Thanks again!	Haesun Park	Dec 12, 2019	Aug 14, 2020
Printed	Page 441 In a tip box	(3rd release) `load()` function don't shuffle shards by default(`shuffle_files=False`). Please check https://www.tensorflow.org/datasets/api_docs/python/tfds/load Test set can be shuffled too, if it use multiple shards. Please check https://github.com/tensorflow/datasets/blob/845e4d0e1dfa73060ab2f6cfdf7ba342434e4def/tensorflow_datasets/image/celeba.py#L148 Note from the Author or Editor: Thanks for your feedback. When I run tfds.load(...) with an old version of TFDS (1.2.0), I get the following warning: WARNING:absl:Warning: Setting shuffle_files=True because split=TRAIN and shuffle_files=None. This behavior will be deprecated on 2019-08-06, at which point shuffle_files=False will be the default for all splits. So it seems that the logic changed since I wrote that chapter. I updated the tip to this: The `load()` function can shuffle the files it downloads: just set `shuffle_files=True`. However, this may be insufficient, so it's best to shuffle the training data some more.	Haesun Park	Dec 12, 2019	Mar 13, 2020
Printed	Page 451 First paragraph	The sentence "across all the previous layers' feature maps" should be "across all the previous layer's feature maps" Note from the Author or Editor: Good catch! Indeed, this typo really changes the meaning of the sentence. It should have been: across all the previous layer's feature maps To avoid any ambiguity, I rephrased this as: across all the feature maps of the previous layer Thank you very much! :)	Houman Kamali	Dec 06, 2020
Printed	Page 453 Eq. 14-1	(3rd release) x_{i', j', k'} \cdot w_{u, v, k', k} should be x_{i', j', k'} \times w_{u, v, k', k} Thanks. Note from the Author or Editor: Indeed, it would make it a bit clearer that this is a multiplication, not a dot product. And more consistent with the right part of the equation. Fixed, thanks! :)	Haesun Park	Dec 12, 2019	Mar 13, 2020
Printed	Page 458 1st paragraph	(3rd release) "(but there is still 75% invariance)" should be "(but there is still 50% invariance)". Thanks. Note from the Author or Editor: Nice catch, indeed, 50% of the output pixels remain unchanged, and 50% change. Fixed, thanks!	Haesun Park	Dec 12, 2019	Mar 13, 2020
Printed	Page 462 1st bullet	(3rd release) I suggest that "no stride" is replaced with "stride 1" to prevent misunderstanding. Thanks. Note from the Author or Editor: Indeed, it's clearer. Fixed, thank you.	Haesun Park	Dec 12, 2019	Mar 13, 2020
PDF	Page 466 3rd to last paragraph	The AlexNet hyper-parameters for local response normalization do not seem to match up to what's mentioned in the paper. In Section 3.3 of the paper the hyper-parameters are set at k=2, r=5 (which is called n in the paper), alpha=0.0001, and beta=0.75 but in the textbook they're set at k=1, r=2, alpha=0.00002, and beta=0.75. Note from the Author or Editor: Good catch, thanks! Mmh, I wonder where I got these wrong numbers from, I certainly didn't invent them. I suspect I was looking at a specific AlexNet implementation. Or maybe I just needed more coffee... Anyway, thanks again, this is fixed now.	Amrit Purshotam	Jun 10, 2020	Aug 14, 2020
Printed	Page 491 Last paragraph in mAP box	(3rd release) COCO makes no distinction between AP and mAP. But I suggest that "(noted AP@[.50:.95] or AP@[.50:0.05:.95])" is replaced with "(noted mAP@[.50:.95] or mAP@[.50:0.05:.95])" to match the sentence "Yes, that's a mean mean average". :) Thanks. Note from the Author or Editor: Indeed, they seem to use both AP@ or mAP@. Changed, thanks!	Haesun Park	Dec 12, 2019	Mar 13, 2020
Printed	Page 501 Last sentence	I suggest replacing "more complex than in Figure 15-4 suggests" with "more complex than what Figure 15-4 suggests". Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Oct 09, 2020
Printed	Page 510 footnote 2	filter_size=1 should be kernel_size=1 Note from the Author or Editor: Good catch, thanks. Indeed, it should be kernel_size, not filter_size.	Wolfram Helwig	Jan 15, 2020	Mar 13, 2020
Printed	Page 527 code after second paragraph	It should be tokenizer.fit_on_texts(shakespeare_text) instead of tokenizer.fit_on_texts([shakespeare_text]) in order that the subsequent call dataset_size = tokenizer.document_count # total number of characters really returns the number of characters in shakespeare_text (1115394). Otherwise (with square brackets), data_size will be equal to the number of submitted documents (1 in this case). Note from the Author or Editor: Good catch, thanks a lot! Indeed, the code should be: tokenizer.fit_on_texts(shakespeare_text) instead of: tokenizer.fit_on_texts([shakespeare_text]) I fixed the code in the book (note that the code in the Jupyter notebook was correct). Thanks again!	Christoph Brauer	Nov 30, 2019	Mar 13, 2020
Printed	Page 530 First paragraph	I suggest replacing the first full sentence of the page with : "Then we can batch the windows and separate the inputs (the first 100 characters) from the targets (the last 100 characters)." At present, the sentence reads "... from the target (the last character)." Note from the Author or Editor: Oh wow, great catch! Indeed, the current text does not match the code example. :/ Thanks a lot!	Ian Beauregard	Oct 15, 2020
Printed	Page 533 1st paragraph	"Window" should be "windows" in the sentence "... and the following batch would not continue each of these window where it left off". Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Oct 15, 2020
Printed	Page 535 2nd paragraph	"Start-of-sequence (SSS)" should probably be "start-of-sequence (SOS)", considering the code block that follows. Note from the Author or Editor: Good catch, that was a typo, it should be SoS, not SSS. Thanks!	Ian Beauregard	Oct 16, 2020
Printed	Page 551 11th line from the bottom	(2nd release) "(i.e., h_(f)) rather than h_(t-1))" should be "(i.e., h_(f) rather than h_(t-1))" Thanks Note from the Author or Editor: Good catch, thanks. However, it's h(t), not h(f): (i.e., h_(t) rather than h_(t-1))	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 556 6th line from the bottom	(2nd release) "Attention Is All You Need: The Transformer Architecture" section uses both of "positional encoding" and "positional embedding". The paragraph that start with "The positional embedding are simply dense vectors..." explain the component in Figure 16-8. So I suggest to change it to "positional encoding". TensorFlow has positional_embedding layer, but I think that positional encoding is more common term. How about using one of the two terms? :) Thanks Note from the Author or Editor: Good point, thanks. After checking the original paper, it seems that they consistently use the term "Positional Encoding", except when they talk about "Trainable Positional Embeddings". So I replaced every occurence of the word "Positional Embedding" with "Positional Encoding", including in the code example on page 558.	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 557 Last paragraph	"... and represented at the bottom of Figure 16-9 (transposed)..." I think "bottom" should be replaced with "top". Note: In my copy of the book, this mistake was corrected in the caption of Figure 16-9, but not in the body of the text. Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Oct 16, 2020
Printed	Page 558 1st paragraph	The word "bottom" should be replaced with "top" in "... the vertical dashed line at the bottom left of Figure 16-9..." Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Oct 17, 2020
Printed	Page 560 2nd line	(2nd release) "d_values is the number of each value" should be "d_values is the number of dimensions of each value". Thanks Note from the Author or Editor: Good catch, thanks! Fixed.	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 562 picture at the top (figure 16-10)	In the picture it looks like the linear transformation happens before the copying for each scaled dot-product attention head. This would mean that every head gets the same input, which would be useless. Rather, the inputs should be copied first, then transformed differently for each head. This is also what the equivalent figure in the current version of ‘Attention Is All You Need’ shows. Note from the Author or Editor: Thanks for your feedback, this is a great observation. I think the reason why the paper originally had a figure which showed a Linear step followed by a Split step (for the value V, the key K and the query Q), is that it was probably the way they implemented the algorithm. In the updated diagram, they now hide this implementation detail to focus more on what the algorithm does, conceptually. Let me explain what I mean using NumPy. Suppose you want to apply two different linear transformations A and B to the same inputs X: import numpy as np X = np.array([[10., 20.], [30., 40.]]) A = np.array([[2., 3., 4.], [5., 6., 7.]]) B = np.array([[8., 9., 10.], [11., 12., 13.]]) One approach is to compute this: R1 = X @ A R2 = X @ B Recall that @ represents matrix multiplication. In this example, this gives the following results: >>> R1 array([[120., 150., 180.], [260., 330., 400.]]) >>> R2 array([[300., 330., 360.], [680., 750., 820.]]) Now, another approach is to concatenate A and B horizontally into a new matrix M, then compute X @ M: M = np.concatenate([A, B], axis=1) R = X @ M Notice that R is just the horizontal concatenation of R1 and R2: >>> R array([[120., 150., 180., 300., 330., 360.], [260., 330., 400., 680., 750., 820.]]) So all we need to do to get R1 and R2 is to split M appropriately: R1 = R[:, 0:3] R2 = R[:, 3:6] You can see that this approach gives the same result as earlier. One advantage of this approach is that it requires a single big matrix multiplication, rather than multiple small ones, so it is faster, especially on a GPU. Moreover, the concatenation step is not needed in practice, since we don't need to have multiple transformation matrices in the first place: a single big matrix will do (it's a single trainable variable instead of multiple ones). However, this is an implementation detail, so it's probably best left out of the book (just as the authors of the paper judged that it was best left out of the paper). I'll get the latest version of the diagram for the next release of my book. Thanks again for your feedback!	Richard Möhn	Dec 16, 2019	Mar 13, 2020
Printed	Page 588 Equation 17-3 and following paragraph	Equation 17-3 (p. 588) has variable K, but the following paragraph doesn't define K and instead defines n, which is not in the equation Note from the Author or Editor: Great catch, thanks. Indeed, the K should be an n in this equation, as well as in equation 17-4. Fixed!	Patrick Coulombe	Jan 26, 2020	Mar 13, 2020
PDF	Page 602 2nd paragraph	In the second paragraph the book says: "For example, when growing the generator’s outputs from 4 × 4 to 8 × 8 (see Figure 17-19), an upsampling layer (using nearest neighbor filtering) is added to the existing convolutional layer, so it outputs 8 × 8 feature maps, which are then fed to the new convolutional layer (which uses "same" padding and strides of 1, so its outputs are also 8 × 8). This new layer is followed by a new output convolutional layer: this is a regular convolutional layer with kernel size 1 that projects the outputs down to the desired number of color channels (e.g., 3)." It seems that you are talking about an Upsampling layer, a conv layer with same padding and kernel size not equal to 1 and a final conv layer with kernel size 1. However, I can't see any conv block before output conv layer (with kernel size 1). Did I miss something? Can you calrify this issue? Thank you very much. Note from the Author or Editor: Thanks for your feedback, I'm sorry this section wasn't clear enough. This paragraph describes what is added in the right side of Figure 17-19 compared to the left side. This includes the Upsampling layer, plus the two new Convolutional layers (with dashed borders), and the components needed to perform the "fade-in" operation (i.e., the alpha operation, the (1-alpha) operation, and the + operation). The 4 other layers are just the same as the ones on the left part of the figure: this includes the Noise layer, the Dense layer, the Conv 1 layer and the original Output Conv Layer (the one with a solid border). If the transition was brutal, without any fade-in mechanism, then we would just remove the original output layer (called "Out conv" with solid border) instantly, and just add the new layers directly: the Upsampling layer and the two new convolutional layers (with dashed borders), and there would be no need for the fade-in operations. Another thing that might have confused you is the fact that the original convolutional layer now outputs 8x8 feature maps. This is not because it was changed in any way, it's just because it now receives 8x8 inputs instead of 4x4 inputs. It really is exactly the same "Out conv" layer as on the left side of the figure. I hope this helps! I'll see what I can do to make this clearer in the book. Thanks again for your feedback.	Hadi	Sep 10, 2020	Sep 18, 2020
Printed	Page 627 Equation 18-1 (1st release)	If I'm not mistaken it should be \sum_{s'} not \sum_{s} Note from the Author or Editor: Oh, great catch, thanks a lot, this was a typo. Equation 18-1 should sum over s', not over s. FYI, there's also a typo in equation 18-3: it should say "for all (s, a)", not "for all (s' a)".	Julien Theron	Apr 28, 2020	Aug 14, 2020
Printed	Page 628 Eq. 18-3	(2nd release) "for all (s' a)" should be "for all (s', a)". Thanks Note from the Author or Editor: Great catch! Actually, it should be "for all (s, a)" Thanks!	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 636 code snippet (training_step)	Even though the code runs without a problem, the algorithm won't be properly trained because the loss is falsely computed. The lossfn in this case (mean_squared_error) expects two list of lists. One being the the Q_values list (which is correct) and the other the target_Q_values (here is the problem). For a quick fix to test you could just do something like so: target_Q_values = [[el] for el in target_Q_values] Now if you compare the two, (I tested with 10.000 iterations), you should see a great difference. Note from the Author or Editor: Thanks a lot for your feedback, that's a great catch. Indeed, target_Q_values should be a column vector. I added the following line just after the definition of target_Q_values, to convert it from a 1D array to a column vector: target_Q_values = target_Q_values.reshape(-1, 1) I fixed the book and the notebook, and I added a comment about this in the notebook.	Lukas Schmidt	Dec 17, 2019	Mar 13, 2020
Printed	Page 640 Last line	In "a transition (s, r, s')", I believe 'r' should be replaced with 'a'. Note from the Author or Editor: Good catch, thanks.	Ian Beauregard	Oct 26, 2020
Printed	Page 646 5th paragraph	(2nd release) VideoWrapper is not yet implemented. :) Thanks Note from the Author or Editor: Thanks for your feedback. Indeed, apparently the VideoWrapper was removed. I removed it from the book.	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 648 footnote 20	(2nd release) "Pink is actually a mix of blue and red" should be "Pink is actually a mix of white and red". Thanks Note from the Author or Editor: Thanks for your feedback. You're right, I should have written "purple" instead of "pink". Fixed! :)	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 655 3rd paragraph	(2nd release) "add_method()" should be "add_batch()". Thanks. Note from the Author or Editor: Good catch, thanks! I meant to say "the add_batch() method" but I wrote "the add_method() method". I'm guessing it was 2am. ;-) Fixed!	Haesun Park	Feb 05, 2020	Mar 13, 2020
Printed	Page 663 Title of the last item	Unlike the other items in the list, in the last item, i.e., the Proximal Policy Optimization, the abbreviation comes before the link. Note from the Author or Editor: Good point, thanks. Fixed, it looks nicer now. :)	Athanasios Kyritsis	Apr 18, 2020	Aug 14, 2020
Printed	Page 692 12th line from the bottom	(3rd release) 12th line from the bottom: "Click Metric, click None to uncheck all locations" should be "Click Metric, click None to uncheck all metrics". 3rd line from the bottom: "Then click the Location drop-down menu, click None to uncheck all metrics" should be "Then click the Location drop-down menu, click None to uncheck all locations". Thanks Note from the Author or Editor: Great catches, thanks! Fixed.	Haesun Park	Mar 03, 2020	Mar 13, 2020
Printed	Page 693 3rd paragraph	(3rd release) I don't understand what's the meaning of "e.g., you can create handy widgets using special comments in your code". Is it https://colab.research.google.com/notebooks/widgets.ipynb? Please let me know about the special comments. Thanks. Note from the Author or Editor: Thanks for your feedback. I meant "handy forms". Check out https://homl.info/colabforms I changed the text in parentheses to: (e.g., you can create handy forms using special comments in your code) And "create handy forms" points to https://homl.info/colabforms	Haesun Park	Mar 03, 2020	Mar 13, 2020
Printed	Page 724 Third from last line	Bold face I printed as I Just a minor issue — thanks for the great book! Note from the Author or Editor: Good catch, thanks! The asciidoc code was: –I~_m_~ I changed it to: –I~_m_~ It should be better. :)	Sebastian Huber	Mar 22, 2020	Aug 14, 2020
Printed	Page 724 Solution to Exercise 7	I think both matrices appended to matrix A' should be -I_m. Note from the Author or Editor: Great catch. Indeed, you are right, it should be -I_m both at the top and bottom. Thank you!	Ian Beauregard	Aug 18, 2020	Sep 18, 2020
	Page 725 No. 5	(2nd Release) In no. 5 solution, 'log' should be 'log_2'. Thanks! Note from the Author or Editor: Thanks for your feedback. I'll fix this now. As a side note, it does not actually change the result in this case, since log_2(x) is proportional to log(x). Specifically: log_2(x) = log(x) / log(2) Therefore 10 * log_2(10m) / log_2(m) is actually equal to 10 * log(10m) / log(m). In fact, since log_m(x) = log(x) / log(m), the answer simplifies to: 10 * log_m(10m) Just being pedantic, don't mind me! ;-)	Haesun Park	May 14, 2022
	Page 730 Last line	It should be A ⊕ B = (A ⋁ B) ⋀ (¬ A ⋁ ¬ B) instead of A ⊕ B = (A ⋁ B) ⋀ (¬ A ⋁ ⋀ B), what you can check by applying the distributive property of ⋀ over ⋁ in the right side of the first equation. Note from the Author or Editor: Thanks for your feedback. I fixed this typo.	Anonymous	Jun 23, 2021
Printed	Page 731 Solution to exercise 3	Maybe I am wrong about this, but I think that there is a mistake in stating "a Logistic Regression classifier will converge to a good solution" on a dataset that is not linearly separable. Logistic regression is linear in the sense that the decision boundary is linear (which is also stated on p. 147). So I don't think that it necessarily finds a good solution on such a dataset. Or am I missing something? Thanks for the great book though - I've learned so much from reading it! :) Note from the Author or Editor: Thanks for your feedback. Sorry, you're right, I meant to say a "reasonably good linear decision boundary", not a solution which finds a non-linear decision boundary, as that's impossible, as you rightly point out. Let me explain: suppose you have a linearly separable dataset, except for a single outlier which is "on the wrong side". A Perceptron will just break down, and not converge at all. A Logistic Regression classifier will "do the right thing" and converge despite the outlier. The linear decision boundary it will converge to will often be good enough, but of course this really depends on the dataset and the task. I'll clarify this paragraph. Thanks again!	Mona Rahn	Mar 27, 2020	Aug 14, 2020
Printed	Page 745 Answer to question 5	In the sentence "Another benefit is that the alignment scores makes the model...", the word "makes" should be "make". Note from the Author or Editor: Good catch, thanks!	Anonymous	Oct 18, 2020
Printed	Page 753 4th line	(3rd release) ParameterServerStrategy perform data parallelism. but it say "useful to train huge model that don't fit in GPU RAM". Is it an explanation for model parallelism? Thanks Note from the Author or Editor: Good catch, thanks. Indeed, I must have lost my train of thought back then, as it really looks like I switched to model parallelism in the very last sentence. :/ Here's a better answer: """ However, it can be useful in some situations, especially when you can take advantage of the asynchronous updates, for example to reduce I/O bottlenecks. This depends on many factors, including hardware, network topology, number of servers, model size, and more, so your mileage may vary. """	Haesun Park	Apr 03, 2020	Sep 18, 2020
Printed	Page 762 Equations C-1 and C-4	As someone already pointed out, the right-hand side of Equation C-4 should be multiplied by -1. Specifically, if you start from Equation C-1 and plug the results from Equations C-3 therein, what you will get is Equation C-4, but with the right-hand side multiplied by -1. It is however correct to say that the current form of the function written at Equation C-4 should be minimized. Consequently, the correct form (current form multiplied by -1) should be maximized. Indeed, in the dual form of the SVM problem, we should first find w and b that minimize the Generalized Lagrangian, with fixed alpha (as was done with the operations leading the Equation C-4). But then, we should find alpha that MAXIMIZES (rather than minimizes) the Generalized Lagrangian (evaluated at w* and b* as found previously). If you look at Equation C-1, you can see that the second term on the right-hand side is always negative if the constraints are respected. So there is no minimum with respect to alpha. Note from the Author or Editor: Thanks a lot for your feedback and for the detailed explanation. I wrongly thought it wasn't an error the first time this was reported, because I figured that minimizing -L was equivalent to maximizing +L, but of course when plugging the results from equation C-3 into the Generalized Lagrangian from equation C-1, we get reversed signs compared to what I had in equation C-4. My sincere apologies to whoever was misled by this error. I've now fixed equation C-4 to invert the signs, I replaced "minimizes" with "maximizes" and I also specified that this is also subject to \sum_{i=1}^m \alpha^{(i)} t^{(i)} = 0. I trust it's all good now. :) Thanks again!	Ian Beauregard	Aug 16, 2020	Sep 18, 2020
Printed, PDF	Page 763 equation C-4	The primal problem is to minimize Equation C-1, but a negative sign is missing on page 763 to derive equation C-4. Since our initial target is to minimize the Lagrange, now we should maximize C-4. At the same time, the second equation in C-3 is a constrained condition for the dual problem. What is more, the equation of the third bullet times a^(i) are also constrains for the dual problem. The equation in chapter 5 is also incorrect. I have a very small request, when you are using some symbols, please define it before use. For example, n_s is not defined on page 763. It should be the number of support vectors found in the problem. Note from the Author or Editor: Thanks for your excellent feedback, I really appreciate it! > The primal problem is to minimize Equation C-1, but a negative sign is missing on page 763 to derive equation C-4. Since our initial target is to minimize the Lagrange, now we should maximize C-4. Unless I overlooked something, I think the sign is correct in equation C-4: in the sentence following this equation, I mentioned that the goal is to minimize the loss, not maximize it. We could reverse the sign and try to maximize the equation instead, but it's really equivalent. > At the same time, the second equation in C-3 is a constrained condition for the dual problem. What is more, the equation of the third bullet times a^(i) are also constrains for the dual problem. Good catch, thanks a lot. I need to add "and \sum_{i=1}^m \alpha^{(i)} t^{(i)} = 0" at the end of equation C-4. > The equation in chapter 5 is also incorrect. Yes, I'll add the missing constraint there as well. > I have a very small request, when you are using some symbols, please define it before use. For example, n_s is not defined on page 763. It should be the number of support vectors found in the problem. Indeed, I try to always define the symbols I use, but apparently I missed this one. Please tell me if you find any other missing definition. Thanks again! :)	Anonymous	Jul 18, 2020	Aug 14, 2020
Printed	Page 769 Sentence below Figure D-2,	In the autodiff appendix, the sentence below figure D-2 should say “To compute df/dy”, not df/dx. Note from the Author or Editor: Thanks for your feedback. Indeed, this was an error I fixed in March 2020, so hopefully the latest releases should be okay now: it should read "To compute df/dy(3,4)...", not "df/dx".	Kenny Song	Apr 20, 2020	Aug 14, 2020
Printed	Page 793 10th line from the bottom	(3rd release) "the output of the addition operation" should be "the output of the power operation". Thanks. Note from the Author or Editor: Great catch, thanks.	Haesun Park	Apr 03, 2020	Aug 14, 2020
	Page 920 A.Exercise . Chapter 6 .5	In the chapter 6 ,Computational Complexity: Comparing all features on all samples at each node results in a training complexity of O(n × m log2(m)) but， In the Appendix A. Exercise .Chapter 6. Decision Trees ,question 5: The computational complexity of training a Decision Tree is O(n × m log(m)). Note from the Author or Editor: Thanks for your feedback. I just fixed this. Side note: O(n × m log(m)) is actually equivalent to O(n × m log_2(m)) since log_2(x) is proportional to log(x). Indeed, log_2(x) = log(x) / log(2).	LBJ6666	Mar 29, 2022
ePub, Mobi,	Page 1212 text	Current Copy takes care of load balancing and scaling for you. It take JSON requests containing the input data (e.g., of a district) Suggested "you. It take JSON requests containing" should be "you. It takes JSON requests containing" Note from the Author or Editor: Good catch, thanks. Fixed!	Anonymous	Jan 22, 2020	Mar 13, 2020
Other Digital Version	2294-2340 Chapter 3 MultiClass Classifaction paragraph 2 and section on SGDClassifier for multiclass classification	There is some conflicting information in the Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow book and the sci-kit learn documentation. In chapter 3 under Multiclass Classification the author states twice that the stochastic gradient descent classifier (SGDClassifier) can handle multi-class classification problems directly without training multiple binary classifiers using One vs Rest/All. This is listed in the second paragraph as well as one or two pages later. The documentation for the SGDClassifier in sci-kit learn directly contradicts this. It states, “SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all” (OVA) scheme” (https://scikit-learn.org/stable/modules/sgd.html) Also, the statement about Logistic Regression being only a binary classifier seems to contradict the sci-kit learn documentation as well. Using the multinomial option, the LR model can learn a true multinomial distribution for multi-class problems (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). Either the book seems incorrect or the sci-kit learn documentation is. Note from the Author or Editor: Great feedback, thanks a lot! Regarding the LogisticRegression class, the default value for the multi_class argument changed after the 2nd edition was published (in version 0.22) from 'ovr' to 'auto': so indeed, the new default multi-class behavior is to learn a true multinomial distribution (the old behavior was to train multiple binary classifiers and to use the OvR strategy). I'll update the book for future releases. Regarding the SGDClassifier class, however, it really seems to be a mistake on my part. :( I tried to search for the origin of my error, perhaps a previous version used a different approach, but it seems that the SGDClassifier behavior has been the same since at least Scikit-Learn 0.17. I'm really sorry about this, I'll update the book now for future releases. Thanks again for your contribution.	Ryan Boch	Jan 08, 2020	Mar 13, 2020
ePub	Page 8499 Chaper 12, custom metrics	In defining "precision" in the code, it should be "p" to consistent with code that follows. i.e. >>> p=keras.metrics.Precision() ... etc.. Then when call >> p.result() it will work Note from the Author or Editor: Great catch! For clarity, I decided to name the variable `precision` everywhere. Thanks for your feedback!	Mohammed El-Beltagy	Oct 27, 2019	Nov 22, 2019
ePub, Mobi,	Page 11387 Ch 15	In Exercise 10 in chapter 15, there is a bad url that leads to a 404: "“Download the Bach chorales dataset and unzip it.” The link goes to https://homl.info/bach which is not found. Note from the Author or Editor: Thanks for your feedback, I fixed the broken URL, it works now.	Anonymous	Jan 22, 2020	Mar 13, 2020