Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Errata for Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update



Version Location Description Submitted By Date Submitted Date Corrected
Other Digital Version
Chapter 14
Table 14.2

Missing max pooling layer between C7 and F8. 13*13*256≠4096 See the table at https://engmrk.com/alexnet-implementation-using-keras/

Note from the Author or Editor:
Great catch, thanks! Indeed, a Max Pooling layer was missing just after the last convolutional layer. The new table looks like this (in AsciiDoc format): |======= | Layer | Type | Maps | Size | Kernel size | Stride | Padding | Activation | Out | Fully connected | – | 1,000 | – | – | – | Softmax | F10 | Fully connected | – | 4,096 | – | – | – | ReLU | F9 | Fully connected | – | 4,096 | – | – | – | ReLU | S8 | Max pooling | 256 | 6 × 6 | 3 × 3 | 2 | `valid` | – | C7 | Convolution | 256 | 13 × 13 | 3 × 3 | 1 | `same` | ReLU | C6 | Convolution | 384 | 13 × 13 | 3 × 3 | 1 | `same` | ReLU | C5 | Convolution | 384 | 13 × 13 | 3 × 3 | 1 | `same` | ReLU | S4 | Max pooling | 256 | 13 × 13 | 3 × 3 | 2 | `valid` | – | C3 | Convolution | 256 | 27 × 27 | 5 × 5 | 1 | `same` | ReLU | S2 | Max pooling | 96 | 27 × 27 | 3 × 3 | 2 | `valid` | – | C1 | Convolution | 96 | 55 × 55 | 11 × 11 | 4 | `valid` | ReLU | In | Input | 3 (RGB) | 227 × 227 | – | – | – | – |======= As you can see, I added the missing max pooling layer S8. Note that I had to rename layer F8 to F9, and layer F9 to F10, including in the sentence right after the table. Side note: if you want to use the Keras implementation at https://engmrk.com/alexnet-implementation-using-keras/, you should fix a few errors first: * Kernel size of 2nd conv layer is 5x5, not 11x11 * Pool size is 3x3 in all max pool layers, not 2x2 * All conv layers should use SAME padding. * AlexNet has 3 dense layers (including the output layer), not 4. Also, I recommend using tf.keras when TF is the desired backend, instead of multi-backend Keras (i.e., you should use "from tensorflow import keras" instead of "import keras"). I wrote this corrected version: https://gist.github.com/ageron/a38c67add35ba8dfcf19bc0fa12e47f0 If you want the exact same model as the original one, you will need to add the Local Response Normalization layers, and also split the model in two as explained in the paper (to run each part on a different GPU). But of course more recent models perform better, so this is purely academic! :) One last thing: you mention that 13*13*256≠4096. With the additional max pooling layer, we now have 6*6*256 inputs going into the first fully connected layer. You might notice that 6*6*256=9216, not 4096. That's okay: 4096 is the number of units in the layer, not the number of inputs. Thanks again for your help!

Mohammed El-Beltagy  Nov 05, 2019  Nov 22, 2019
Mobi
Page ch. 14
TensorFlow Implementation (code)

outputs =tf.nn.conv2d(images,filters, strides=1, padding="same") should be changed to outputs =tf.nn.conv2d(images,filters, strides=1, padding="SAME")

Note from the Author or Editor:
Good catch! Indeed, the `tf.nn.conv2d()` function accepts only uppercase `padding` values. `keras.layers.Conv2D` supports both uppercase and lowercase arguments, and Francois Chollet told me that the lowercase values are preferred, so I updated the whole book. I didn't realize that `tf.nn.conv2d()` was different. Thanks!

Mohammed El-Beltagy  Oct 27, 2019  Nov 22, 2019
ePub
Page ch. 14
TensorFlow Implementation, third bullet point

"stride length of 2" should be replaced by "stride length of 1" to be consistent with above code.

Note from the Author or Editor:
Great catch, it should indeed be "stride length of 1". Thanks!

Mohammed El-Beltagy  Oct 27, 2019  Nov 22, 2019
ePub
Page ch. 14
TensorFlow Implementation

For the code involving "load_sample_image" Pillow must be installed python3 -m pip install Pillow Otherwise we will get an error. This could be added as footnote, or in the "Create the Workpace" section in chapter 2.

Note from the Author or Editor:
Indeed, the Pillow package is required by the `load_sample_image()` function. I added a note. Thanks!

Mohammed El-Beltagy  Oct 27, 2019  Nov 22, 2019
Safari Books Online
?
Section: Computing Gradients Using Autodiff

Super minor typo: just replace you must call the tape’s jabobian() method with you must call the tape’s jacobian() method

Thierry Herrmann  Sep 30, 2019  Oct 11, 2019
Safari Books Online
"Changes in the Second Edition," Numbered List Point 1

'covolutional' should be 'convolutional' (missing an 'n'). (I couldn't find page numbers in the Safari Books Online iPad app.)

Note from the Author or Editor:
Good catch, thanks. Fixed.

Leif Eric Fredheim  Jan 07, 2020  Mar 13, 2020
Other Digital Version
ch. 7
Code snippet before Extra-Trees section

"The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier: bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)" splitter="random" makes this BaggingClassifier not equivalent to RandomForestClassifier since splits in RandomForestClassifier are not random, but best splits made on random subsets of features. The following snippet fixes the issue: bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="best", max_features="auto", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1) With these parameters (and set random state) the predictions made by BaggingClassifier in 07_ensemble_learning_and_random_forests.ipynb will be identical to the predictions of RandomForestClassifier: >>> np.sum(y_pred == y_pred_rf) / len(y_pred) 1.0

Note from the Author or Editor:
Thanks for your feedback, great analysis. I updated the code example to be: bag_clf = BaggingClassifier( DecisionTreeClassifier(max_features="auto", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1) I left out splitter="best" since it is the default value (and the line overflow would require changing the page layout, which I try to avoid when possible).

Slava Ilin  Jan 12, 2020  Mar 13, 2020
Safari Books Online
ch. 3
Tip under Figure 3-6

The tip ends by noting that the PR curve "could be closer to the top-left corner". Assuming you're referring to Figure 3-5, does this mean the top-right corner? That curve, of course, hits the top-left corner. In either case, it's still not entirely clear to me *why* the ROC is more affected by skewed data. Perhaps this tip could be expanded.

Peter Drake  Feb 26, 2020  Mar 13, 2020
Other Digital Version
Ch 3 (code)
Cell 24

When I run all of the cells up through from sklearn.metrics import precision_score, recall_score precision_score(y_train_5, y_train_pred) in the Jupyter notebook from GitHub (running on Colab), I get 0.837..., not the 0.729... shown in the (Safari) book. I believe the problem occurs at least as early as cell 22 (the confusion matrix two cells earlier), which gives: array([[53892, 687], [ 1891, 3530]]) rather than the 53057, 1522, 1325, 4096 shown in the book. This makes cell 25, 4096 / (4096 + 1522) rather mysterious, as the numbers 4096 and 1522 now seem to come out of nowhere.

Note from the Author or Editor:
Thanks for your feedback. Indeed, making the code perfectly reproducible for several years turns out to be quite a challenge! Every time a new version of Scikit-Learn (or NumPy, Keras, TensorFlow, Matplotlib, Pandas) is released, I have to check all the notebooks to ensure they still produce the same output. The most common source of changes is when the default value of some hyperparameter is modified. For example, if the default number of iterations changes, then all the results change. I managed to keep up with this up to now by explicitly setting some of the hyperparameter values to their old default value (or in some cases, to their new default value, when they were announced in advance). You'll see some comments about this in the notebooks. Unfortunately, sometimes the algorithms themselves get tweaked slightly, and there's really nothing I can do about that. I was fortunate enough to be mostly spared by this problem for the 1st edition, but my luck ran out: * Scikit-Learn 0.21 fixed some bug in SGDClassifier (and many other models), so models now produce slightly different results (see https://scikit-learn.org/0.21/whats_new.html#id6). This happened a couple months after I had finished writing the book, and it was off to press. * As if this wasn't enough, TensorFlow 2.1 completely changed the way it generates random numbers, compared to TensorFlow 2.0. So pretty much all TensorFlow models give slightly different results now, and there's no going back. The only way to reproduce the exact results from the book is to revert to previous versions of Scikit-Learn and TensorFlow. However, I don't recommend this solution. It's preferable to just accept the fact that there will be (hopefully small) differences between the text and the results you get. In the short term, I'll add warnings to the Jupyter notebooks to explain that the results might differ slightly from the book (and explain why). Then when I have time, I'll run all the notebooks using the latest version of all libraries, and I'll update all the code examples in the book that need to be changed. Oh wow... This book is so much work... sigh... ;-) Thanks again for your help.

Peter Drake  Feb 26, 2020  Mar 13, 2020
Safari Books Online
ch 10
Second bullet under "Creating the model using the sequential API"

You say that if Flatten "receives input data X, it computes X.reshape(-1, 1)". It applied to an individual data point (e.g., a Fashion MNIST) image, wouldn't this turn the image into a column vector? Don't we want (1, -1) or, better yet, (-1,), to turn it into a row? This situation gets even more complicated if X is an entire input set, which is of shape (60000, 128, 128) in the Fashion MNIST example. We'd like it to end up (60000, 784), right? I can't see how (-1, 1) would do that.

Note from the Author or Editor:
Thanks a lot for your feedback. Indeed, this is an error. I should have written: "receives input data X, it computes X.reshape(-1, 28*28)". Fixed, thanks again!

Peter Drake  Mar 17, 2020  Aug 14, 2020
Safari Books Online
Ch12
Below the walkthrough of custom loop

Below the walkthrough of custom loop, it says "If you set the optimizer’s clipnorm or clipvalue hyperparameter, it will take care of this for you." I'm not sure the "this" here is referring to the custom loop or the clipping. Maybe a little bit more explanation here.

Note from the Author or Editor:
Thanks for your feedback. Indeed, this sentence was not very clear. I replaced it with this sentence: """ If you want to apply Gradient Clipping (see Chapter 11), just set the optimizer's `clipnorm` or `clipvalue` hyperparameter. """ This works both when using model.fit() or when writing a custom loop. If you need any other transformation of the gradients when writing a custom loop, just modify the gradients before calling apply_gradients(). Thanks again.

Chih  Apr 10, 2020  Aug 14, 2020
Safari Books Online
Ch13
Putting Everything Together

Figure 13-2 says that repeat() is called right after list_files(). But in the code block, repeat() is called after shuffle(). I know the effect of calling repeat() on a shuffled dataset is mentioned earlier in that chapter. Does the difference between figure and code matters?

Note from the Author or Editor:
Thanks for your feedback. Indeed, there's a mismatch between figure 13-2 and the code. It's a bit more common to place the repeat() step after the shuffle() step (as in the code). I'm not sure why I placed it in the wrong position (note that shuffle() and map() are also reversed, ooooh dear). I'll fix the figure to match the code. Note that there is a small difference between repeat().shuffle(...) and shuffle(...).repeat(). This is best explained with an example: >>> import tensorflow as tf >>> [i.numpy() for i in tf.data.Dataset.range(4).repeat(2).shuffle(4)] [2, 0, 3, 2, 3, 1, 1, 0] >>> [i.numpy() for i in tf.data.Dataset.range(4).shuffle(4).repeat(2)] [0, 2, 3, 1, 0, 1, 3, 2] Notice that in the first case, the number 2 is repeated twice before the number 1 appears. In the second case, the first 4 elements will always include 0, 1, 2, 3. Thanks again!

Anonymous  Apr 14, 2020  Aug 14, 2020
Safari Books Online
Ch16
last line of the paragraph below figure16.9

The line says "the model would not be able to distinguish positions p = 25 and p = 35 (marked by a cross)." I think it should be p = 22 and p = 35

Note from the Author or Editor:
Good catch, thanks.

Anonymous  May 04, 2020  Aug 14, 2020
Printed
Page xvii
3rd paragraph

The 3rd paragraph currently ends with the following: --in particular (hNumPy, pandas, and Matplotlib. There are two characters that are out of place "(h". It should be rewritten in one of the following two ways: (in particular, NumPy, Pandas, and Matplotlib). or --in particular, NumPy, pandas, and Matplotlib.

Note from the Author or Editor:
Thanks for your feedback. That's strange, I don't see this issue in my copy of the book (1st release of the 2nd edition). The source code (in AsciiDoc) for this paragraph is: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, http://numpy.org/[NumPy], http://pandas.pydata.org/[pandas], and http://matplotlib.org/[Matplotlib]. """ In printed copies, this should render as: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, NumPy (http://numpy.org/), pandas (http://pandas.pydata.org/), and Matplotlib (http://matplotlib.org/). """ This is exactly what I'm seeing in my printed copy. In electronic versions, you should see this: """ This book assumes that you have some Python programming experience and that you are familiar with Python's main scientific libraries—in particular, NumPy, pandas, and Matplotlib. """ In the "Version of product where error was found", you selected "Printed", but the text you are seeing looks like it's from the electronic version. Could you please confirm the version of the product (printed, ePub, etc.), and also specify which release you have? The release number can be found on the page immediately before the table of contents. Thank you. **EDIT** Apparently this typo was introduced during the production phase of one of the earlier releases, but it was quickly fixed. Sorry for the inconvenience.

Steve Anderson  Sep 17, 2020 
Safari Books Online
ch 10
In the paragraph just before Figure 10-9.

"so" seems a typo in the first sentence : If each instance can belong only so a single class, out of 3 or more possible classes...

Note from the Author or Editor:
Nice catch, I just fixed this typo, thanks a lot.

Ami Ka  Apr 10, 2019  Sep 05, 2019
Safari Books Online
ch 10
under "COMPILING THE MODEL"

It seem's "sigmoid_crossentropy" is mistakenly used instead of "binary_crossentropy" in this sentence: If we were doing binary classification (with one or more binary labels), then we would use the "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "sigmoid_crossentropy" loss.

Note from the Author or Editor:
Good catch, thanks a lot, I just fixed this.

Ami Ka  Apr 11, 2019  Sep 05, 2019
Safari Books Online
ch 11
before Unsupervised Pretraining

In a parenthesis: (which may be due to shear luck) shear luck to sheer luck

Note from the Author or Editor:
Indeed, it should be sheer instead of shear, thanks!

Ami Ka  Apr 28, 2019  Sep 05, 2019
Safari Books Online
ch 11
Avoiding Overfitting Through Regularization>Learning Rate Scheduling>Power scheduling

Probably "k" in the formula should be replaced by "s". Set the learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/k)c. The initial learning rate η0, the power c (typically set to 1) and the steps s are hyperparameters. The learning rate drops at each step, and after s steps it is down to η0 / 2. After s more steps, it is down to η0 / 3. Then down to η0 / 4, then η0 / 5, and so on. As you can see, this schedule first drops quickly, then more and more slowly. Of course, this requires tuning η0, s (and possibly c).

Note from the Author or Editor:
Great catch thanks! Indeed, it should be η(t) = η0 / (1 + t/s)c

Ami Ka  May 01, 2019  Sep 05, 2019
Safari Books Online
ch 11
Dropout>Note

However, it you double it, inference time will also be doubled. to However, if you double it, inference time will also be doubled.

Note from the Author or Editor:
Thanks a lot, indeed, it's a typo. I just fixed it: should be "if you double" rather than "it you double".

Ami Ka  May 07, 2019  Sep 05, 2019
Safari Books Online
ch 14
Convolutional Layer>TensorFlow Implementation>padding

Where "same" padding is explained( in parenthesis): ...In this case, the number of output neurons is equal to the number of input neurons divided by the stride, rounded up (in this example, 13 / 5 = 2.6, rounded up to 3). The mentioned example in the parenthesis doesn't have any number for the input size and also the stride is 1. Probably you meant the next example in Figure 14-7.

Note from the Author or Editor:
Great catch, thanks! I changed the bullet point like this: If set to "same", the convolutional layer uses zero padding if necessary. The output size is set to the number of input neurons divided by the stride, rounded up. For example, if the input size is 13 and the stride is 5 (see Figure 14-7), then the output size is 3 (i.e., 13 / 5 = 2.6, rounded up to 3). Then zeros are added as evenly as possible around the inputs, as needed. When `strides=1`, the layer's outputs will have the same spatial dimensions (width and height) as its inputs, hence the name _same_. Cheers, Aurélien

Ami Ka  Jun 28, 2019  Sep 05, 2019
Safari Books Online
Ch10
Above Figure 10-15

Inputs A and B, shape attributes are wrong (should be 6, 5 not 5, 6)

Note from the Author or Editor:
Great catch, thanks! The problem was in the previous sentence, it was: "For example, suppose we want to send five features through the deep path (features 0 to 4), and six features through the wide path (features 2 to 7):" but the words "deep" and "wide" should have been reversed: "For example, suppose we want to send five features through the wide path (features 0 to 4), and six features through the deep path (features 2 to 7):" Thanks again, Aurélien

MNK  Jul 03, 2019  Sep 05, 2019
Safari Books Online
Ch 16
Figure 16-9.

Sine/cosine positional embedding matrix (transposed, bottom) and a focus on two values of i (top) I think "bottom" and "top" are switched

Note from the Author or Editor:
Great catch, indeed they were. I just fixed this, thanks a lot!

Christopher Akiki  Jul 04, 2019  Sep 05, 2019
ePub
Page Ch14
CNN to tackle Fashion MNIST

padding='same'), instead of: padding='same',),

Note from the Author or Editor:
Good catch, not sure why there were extra commas there, I'm guess I changed the order of the arguments. Side-note: as you may know, having a comma before the closing parenthesis is actually valid Python code (ugly code, but valid). It's even required for tuples with a single element, such as (42,). I also use this in lists, tuples or argument lists spanning multiple lines, such as this: a = ( "apples", "cherries", "bananas", ) This makes it easier to move lines around without getting syntax errors. But ...padding='same',) really does not make much sense. Cheers, Aurélien

MNK  Jul 05, 2019  Sep 05, 2019
Safari Books Online
Chapter 9
Paragraph before Figure 9.1

" This is where clustering algorithms step in: many of them can easily detect the top-left cluster. It is also quite easy to see with our own eyes, but it is not so obvious that the lower-right cluster is composed of two distinct sub-clusters." This description DOES NOT MATCH THE FIGURE. The TOP-RIGHT CLUSTER has two distinct sub-groups and the LOWER-LEFT CLUSTER easily stands out by itself. So as written, the text has a VERY confusing lack of correspondence with the figure.

Note from the Author or Editor:
Great catch, thanks a lot! Indeed, it should say "lower-left cluster" and "upper-right cluster", respectively. Here's the full correct sentence: This is where clustering algorithms step in: many of them can easily detect the lower-left cluster. It is also quite easy to see with our own eyes, but it is not so obvious that the upper-right cluster is composed of two distinct sub-clusters. Thanks again! Aurélien

Jim Lewis  Aug 11, 2019  Sep 05, 2019
Safari Books Online
1
First line.

First sentence reads... "When most people hear 'Machine Learning,' they picture a robot: a dependable butler or a deadly Terminator, depending on who you ask." It's not "...who you ask," it's "... whom you ask." Should use proper English, at least in the very first sentence of the book. You would not say "You ask he," you'd say "You ask him."

Note from the Author or Editor:
Thanks for your feedback. As you might know, I am French, so please forgive my English mistakes. The he/him rule is very helpful. It's interesting that no one pointed out this error to me before, even though it's in the very first sentence! :) I think it goes to show that people are getting used to this mistake, to the point that many people on the Web seem to argue that "whom" now sounds too formal. Perhaps in a few decades it will no longer be considered a mistake. That said, of course, I've fixed the book now, thanks again!

Anonymous  Mar 21, 2020  Aug 14, 2020
Safari Books Online
1
Chapter 3 - Threshold test

The following code is used to describe the effect of threshold adjustments on the recall. >>> threshold = 8000 >>> y_some_digit_pred = (y_scores > threshold) >>> y_some_digit_pred array([ True]) The result should be array([False]), as indicated on the GitHub project: https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb An output of 'array([ True])' would indicate that adjusting the threshold had no impact on the recall.

Note from the Author or Editor:
Great catch! Indeed, this was a copy/paste error, thanks for spotting it, I just fixed the book, the fix will be in the next release. I wrote a script that verifies that all the code examples in the book are present in the notebook, but right now it does not look at the outputs, I'll fix that. Thanks again! Aurélien

Hussein Khalil  Mar 25, 2019  Sep 05, 2019
Safari Books Online
3
Chapter 3. Classification / Confusion Matrix / Equation 3-1. Precision

Sorry about my language. In Chapter 3. Classification / Confusion Matrix / Equation 3-1. Precision and Equation 3-2. Recall and Equation 3-3. F1 I do not see the division sign. Can you check all equations?

Note from the Author or Editor:
Thanks for your feedback. I'm guessing you are reading the book on the Safari Platform using the Chrome browser. Unfortunately, Chrome stopped supporting MathML, so the equations don't display properly. O'Reilly is working on fixing this, and I asked them to add a message to warn users. In the meantime you can work around this issue by using another browser: Firefox or Safari. Thanks for your understanding. 10/18/2019: the issue is now fixed in Chrome.

Alexander Morozov  Oct 16, 2019  Oct 18, 2019
PDF
Page 14
First paragraph - First line

an additional "ag" next to "is" : "Reinforcement Learning isag a very" -> "Reinforcement Learning is a very"

Note from the Author or Editor:
Good catch, thanks. I fixed this typo, it should be fine now in the electronic versions, and it will be correct in the 2nd release of the book (printed in October).

Safouane Chergui  Oct 07, 2019  Oct 18, 2019
PDF
Page 30
Bullet pt listing in "Underfitting the Training Data" section

The list of methods to counter underfitting is in plain text, while the analogous list with regards to overfitting in the previous section was highlighted in a warning/caution frame; might want to adjust.

Note from the Author or Editor:
Thanks, good point. I'll change the underfitting section to use a warning frame.

Hieronim Kubica  May 31, 2019  Sep 05, 2019
PDF
Page 47
End of virtualenv box

This is an error of omission. If we are going to be using jupyter in a virtual environment. Then we must also setup jupyter to use the libraries associated with said environment. The requires the following two steps $ python3 -m pip install -U ipykernel $ python3 -m ipykernel install --user --name=my_env After that, when starting jupyter you can select "my_env" and start working in that environment.

Note from the Author or Editor:
Thanks Mohammed, great catch! Since the ipykernel package is installed automatically along with jupyter, the first command is not required, but the second is important (at least if you plan to have more than one virtualenv, which is the whole point). I updated the book like this: -------------------------------------------- $ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn Collecting jupyter Downloading https://[...]/jupyter-1.0.0-py2.py3-none-any.whl Collecting matplotlib [...] If you created a virtualenv, you need to register it to Jupyter and give it a name: $ python3 -m ipykernel install --user --name=python3 Now you can fire up Jupyter by typing the following command: $ jupyter notebook [...] Serving notebooks from local directory: [...]/ml [...] The Jupyter Notebook is running at: [...] http://localhost:8888/?token=60995e108e44ac8d8865a[...] [...] or http://127.0.0.1:8889/?token=60995e108e44ac8d8865a[...] [...] Use Control-C to stop this server and shut down all kernels [...] -------------------------------------------- Notice that I removed this section: -------------------------------------------- To check your installation, try to import every module like this: $ python3 -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn" There should be no output and no error. -------------------------------------------- This is because I didn't want the layout of the book to be affected too much, and this paragraph is not necessary since users will notice if there are errors in the previous steps. Again, thanks a lot for your great feedback!

Mohammed El Beltagy  Oct 15, 2019  Nov 22, 2019
Printed
Page 86
Last line

Just a tiny detail here. There is an "import" command missing before the last instruction of the page. NumPy was not loaded yet.

Note from the Author or Editor:
Good catch, thanks. In later chapters I did not repeat all the imports, because I though it was redundant (after a while, I assume the reader understands what np stands for and how to import it), but in the earlier chapters, it's useful to spell everything out. Fixed. :)

Bruno Machado  Apr 02, 2020  Aug 14, 2020
Printed
Page 138
2nd paragraph

It says : "... the dashed line in the righthand plot in Figure 4-18 (with alpha = 10^-7) looks quadratic, almost linear." Actually, it does not look quadratic (maybe cubic?). Also, it is quite disputable that is looks "almost linear".

Note from the Author or Editor:
Indeed, good catch! You just made me realize that this figure changed slightly between the first edition and the second edition of the book, probably because of slight tweaks in Scikit-Learn's algorithms. Here is what the figure looks like in the first edition: https://snipboard.io/fBgiRw.jpg I've fixed the sentence to say "looks roughly cubic". Thanks again!

Ian Beauregard  Aug 13, 2020  Aug 14, 2020
Printed
Page 143
Eq 4-13

(3rd release) In Eq 4-13, bottom line of p143 and Eq 4-19, x^T \theta^{(k)} is used But for matching the order of theta and x in other places, I suggest (\theta^{(k)})^T x or \theta^T x Thanks

Note from the Author or Editor:
Thanks for your suggestion, I fixed the 3 instances you pointed out. FYI, I hesitated between "x^T theta" and "theta^T x" because the first linear equation in chapter 1 is written y = theta0 x0 + theta1 x1 + ..., which naturally translates to y = theta^T x. It would be weird to write y = x0 theta0 + x1 theta1 + ... However, when dealing with matrices, one typically writes y = X W: here, X has to appear first (and there's no transpose), because each row of X already corresponds to a transposed feature vector. I remember being confused the first time I saw this, so I wanted to quickly transition from theta-first to X-first. However, I was not careful enough, so I ended up having a confusing mixture of both! Oops... I think you're right that consistently using theta-first before we really tackle matrices is probably better.

Haesun Park  Mar 03, 2020  Aug 14, 2020
Printed
Page 158
Last sentence

The book says : "The hyperparameter coef0 controls how much the model is influenced by high-degree polynomials versus low-degree polynomials." I think it should say high-degree and low-degree TERMS instead of polynomials.

Note from the Author or Editor:
Good catch, thanks. I changed that sentence to: """ The hyperparameter `coef0` controls how much the model is influenced by high-degree terms versus low-degree terms. """

Ian Beauregard  Aug 15, 2020  Sep 18, 2020
Printed
Page 161
1st paragraph, above the figure

In chapter 5, pages 160 and 161, it says: So γ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and if it is under?fitting, you should increase it (similar to the C hyperparameter). As far as I know, to avoid overfitting, we must apply limitations to the method (increasing regularization) and vice-versa. It is also stated in the solution of exercise 9 in chapter 4.

Note from the Author or Editor:
Thanks for your feedback. By "regularization hyperparameter", I just meant that it is a hyperparameter that lets you control regularization. Perhaps for more clarity I should have said that it is a "reverse regularization hyperparameter", since reducing it increases regularization. I'll update the book.

Sajjad  Jan 22, 2020  Mar 13, 2020
PDF
Page 165
Under Equation 5-2

The following sentence: Figure 5-12 shows the decision function that corresponds to the model in the LEFT in Figure 5-4 Should be: Figure 5-12 shows the decision function that corresponds to the model in the RIGHT in Figure 5-4 This can be confirmed in the corresponding Jupyter notebook (https://github.com/ageron/handson-ml/blob/master/05_support_vector_machines.ipynb Input #10 and #31) which both of them are using the same variable name "svm_clf2".

Note from the Author or Editor:
Good catch, thanks. Indeed, it should be "right" instead of "left.

Nathan Young  Jun 15, 2020  Aug 14, 2020
Printed, Safari Books Online
Page 173
First sentence at the top of the page, right underneath Equation 5-13.

After presenting Equation 5-13 (labelled: "Linear SVM classifier cost function"), the paragraph reads as follows: "The first sum in the cost function will push the model to have a small weight vector w, leading to a larger margin. The second sum computes the total of all margin violations." In the equation, there is only one summation. I believe what is meant to be said is that the first "term" of the cost function is responsible for the margin, and the second "term" of the cost function (which is the summation) is responsible for minimizing margin violations. When you refer to them as "first sum" and "second sum" it makes one think there should be two summations in the equation. Thank you!

Note from the Author or Editor:
Thanks for your feedback. I think I wrote "first sum" and "second sum" because in my mind the first term (1/2 w^T w) is actually a summation, since it is equal to 1/2 * (w_1^2 + w_2^2 + w_3^2 + ... + w_n^2). It's half of the sum of squares of the elements of w. But I agree that it's really not clear right now, so I'll write "first term" and "second term" instead, thanks again!

AJ  Nov 09, 2020 
Printed
Page 197
1st paragraph in "Random Forests". 2nd sentence

The sentence reads "Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can instead use the RandomForest classifier class, [..]" The word instead is used twice in the same sentence. It should probably read "Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, you can use the RandomForest classifier class, [..]"

Note from the Author or Editor:
Good catch, thanks. Instead of two insteads, I prefer a single one instead. ;-)

Ricardo Blasco  Nov 19, 2020 
Printed
Page 203
4th Paragraph

In this paragraph, it says: "Let's go through a simple regression example, using ..... (of course, Gradient Boosting also works great with regression tasks)." Instead of "regression tasks" (in the parentheses), it should probably say "classification tasks". Thanks!

Note from the Author or Editor:
Good catch, thanks, that's what I meant. Fixed. :)

AJ  Nov 19, 2020 
PDF
Page 211
First paragraph

brew is deprecated and its github repo recommends DESlib as an alternative (https://github.com/scikit-learn-contrib/DESlib)

Note from the Author or Editor:
Thanks for your feedback, indeed brew is deprecated and DESlib looks like a great replacement. I updated the book, hopefully the change will make it to the 2nd release (printed this week), or else it will be the 3rd release.

Safouane Chergui  Oct 13, 2019  Oct 11, 2019
PDF
Page 245
second black dot

There should be a sign of devision "/" between D(x(i))2 and sum_{j=1}^{m} D(x(j))2 in K-Means++ initialization algorithm.

Note from the Author or Editor:
Great catch, thanks. This was a latexmath rendering issue, I just fixed it.

Hao  May 20, 2019  Sep 05, 2019
Other Digital Version
251-252
first paragraph

Chapter 9 "Using Clustering for Preprocessing" talks about clustering as an efficient approach to dimensionality reduction. With the example chosen, without performing a preclustering on the training data, each data has 64 features. If we perform a preclustering (via a pipeline) with 50 clusters, this is effectively a dimensionality reduction as 50<64. But at the end of the section, if we eventually keep k = 99, can we still speak of a dimensionality reduction? However, I recognize that the accuracy gets better.

Note from the Author or Editor:
Thanks for your feedback. Indeed, you're absolutely right: it's not dimensionality reduction anymore if we keep k=99 while the original dimensionality was 64. :/ This section definitely deserved a bit of clarification, so I changed the introduction from: """ Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing step before a supervised learning algorithm. """ to: """ Clustering can be an efficient preprocessing step before a supervised learning algorithm. """ Then later in the section, right after the sentence "How about that? We reduced the error rate by almost 30% (from about 3.1% to about 2.2%)!", I added the following sentence: """ The clustering step reduces the dataset's dimensionality (from 64 to 50 dimensions), but the performance boost comes mostly from the fact that the transformed dataset is closer to being linearly separable than the original dataset, and therefore it is much easier to tackle with Logistic Regression. """ And I removed "But" in "But we chose the number of clusters k arbitrarily". Hopefully this will be much clearer. Thanks again for your helpful feedback.

Olivier Lourme  Jun 25, 2020  Aug 14, 2020
Printed
Page 285
Ch 10, page 285, last phrase

In the book it is said that a Perceptron with two inputs and three outputs with a step function is a multioutput classifier. I think this Perceptron is a multilabel classifier, indeed each output is binary and not number.

Note from the Author or Editor:
Good catch! Indeed, the sentence should be: """ This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multilabel classifier. """ Thanks a lot!

Chakib BELAFDIL  Sep 04, 2020  Sep 18, 2020
Printed
Page 286
Equation 10-2

In this equation, the argument to phi is written as XW + b. Here the product will have m rows (one for each instance) and n_out columns (one for each AN in the output layer). It seems to me that the addition in this expression can only be correctly understood in the context of (something like) Numpy broadcasting rules which will operate on b so that it's the same shape as the result of the XW product. Since this isn't written as part of a code snippet, I suggest adding something the the third bullet of the explanation of the equation to make it clear what's going on. Somewhat similarly, the application of phi to a matrix with shape (m, n_out) to get another (m, n_out) matrix is pretty clear in the context of Numpy code, but less clear here. Maybe something like "Here \phi is being applied to each element separately." could be a good addition to the 4th bullet? Thanks for a terrific book!

Note from the Author or Editor:
Thanks for your feedback. I thought I explained broadcasting earlier in the book, but I couldn't find where. The only mentions I found are in Appendix A (in the solution to exercise 10.6) and in chapter 16 (when discussing Positional Encoding). So I added a footnote when introducing the bias vector *b*. I wrote this: "In mathematics, the sum of a matrix (*XW*) and a vector (*b*) is undefined. However, in Data Science, we allow "broadcasting": we add the vector to every row in the matrix." Thanks again!

Ken Basye  Mar 12, 2020  Aug 14, 2020
Printed, Safari Books Online
Page 302
Last paragraph on page

Instead of "If we were doing binary classification (with one or more binary labels)" This should be "If we were doing multi-label classification (with one or more binary labels)"

Note from the Author or Editor:
Thanks for your feedback. Indeed, this could have been clearer. I changed the sentence to: "If we were doing binary classification or multilabel binary classification"

Hamel Husain  Apr 03, 2020  Aug 14, 2020
Printed
Page 304
3rd paragraph

(2nd release) "... set the `sample_weight` arguement (it supersedes `class_weight`)." Acually tf.keras use `sample_weight` x `class_weight`. Please check https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training_utils.py#L1035 Thanks.

Note from the Author or Editor:
Great catch! I replaced "(it supersedes `class_weight`)" with "(if both `class_weight` and `sample_weight` are provided, Keras multiplies them)". Thanks for your help!

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 306
Using the model to make predictions.

We scaled the training/validation set features by dividing by 255.0. To obtain accurate performance metrics on the test set, we should also apply the same pre-processing step. current code: model.evaluate(X_test, y_test) ... X_new = X_test[:3] y_proba = model.predict(X_new)

Note from the Author or Editor:
Good catch, thanks! In the Jupyter notebook, the test set is properly scaled, but for some reason I did not include that line in the book. On page 298, just after scaling the training set and the validation set, I just added the following line in the book: X_test = X_test / 255.0

Francisco Javier Perez Leon  Jan 14, 2020  Mar 13, 2020
Printed
Page 306
1st paragraph

The book says "you should be able to reach close to 89% validation accuracy" if you continue training. However, on page 304, before the tip, the book says that the validation accuracy already reached 89.26% after 30 epochs. Training for 30 more epochs, I got 89.42% accuracy.

Note from the Author or Editor:
Good point, thanks. I replaced 89% with 89.4%.

Ian Beauregard  Sep 14, 2020  Sep 18, 2020
Printed
Page 314
Next to the last paragraph

In the code example for saving a model to HDF5 file, the first line should contain `keras.models.Sequential` instead of `keras.layers.Sequential`.

Note from the Author or Editor:
Great catch! Indeed, I meant to write `keras.models.Sequential` instead of `keras.layers.Sequential`. Thanks!

Dmitry Kabanov  Nov 14, 2019  Nov 22, 2019
Printed
Page 325
Question 10

I suggest replacing "98% precision" with "98% accuracy".

Note from the Author or Editor:
Good catch, thanks. I meant accuracy, not precision.

Ian Beauregard  Sep 17, 2020 
Printed
Page 328
Exercise 2

A closing parenthesis is messing before the OR operator on the last line.

Note from the Author or Editor:
Good catch, thanks. This should indeed have been: A xor B = (A and not B) or (not A and B) Replacing "xor", "and" and "not" with the appropriate symbols. Fixed! :)

Ian Beauregard  Sep 11, 2020  Sep 18, 2020
Printed
Page 329-330
Last sentence

(2nd release) "... plotting the error, and finding the point where the error shoots up)." I think that it's better 'loss' instead of 'error', because Learning Rate section use 'loss' to explain how to find learning rate. Thanks.

Note from the Author or Editor:
Good point, I replaced "error" with "loss" in this sentence.

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 329 and 731
Exercise 6 and solution to Exercise 6

"Weight vector" should be replaced by "weight matrix" on both pages 329 and 731. On page 731, the first sentence following the colon should probably get its own item in the list (letter 'a') and on the last item in the list, Y should be boldfaced (now printed as Y*).

Note from the Author or Editor:
Great catches! Yes, I should have written "weight matrix" instead of "weight vector" on pages 329 and 731. I fixed the first formatting issue in February, it should be fine in the latest releases of the book. I just fixed the second issue (the Y in the last bullet point should be a boldface Y, not Y*). Thanks a lot.

Ian Beauregard  Sep 16, 2020 
Printed
Page 338
1st line under 1st code block

(3rd release) "LeakyRelu(alpha=0.2)" should be "LeakyReLU(alpha=0.2)". Thanks.

Note from the Author or Editor:
Good catch, thanks. It should indeed read LeakyReLU(alpha=0.2).

Haesun Park  Apr 03, 2020  Aug 14, 2020
Printed
Page 341
2nd line over the note.

(3rd release) For "TFLite's optimizer does this automatically", I suggest to change 'optimizer' to 'converter'. Because we often say TFLite converter as in Ch. 19 and It's better to avoid misunderstanding as Keras optimizers. Thanks.

Note from the Author or Editor:
Good point, it's clearer with "TFLite's converter". Thanks!

Haesun Park  Apr 03, 2020  Aug 14, 2020
Printed
Page 344
2nd to last paragraph

"... but the `fit()` method sets to it to 1" should be "... but the `fit()` method sets it to 1."

Note from the Author or Editor:
Good catch, thanks. Indeed, it should have been "...but the `fit()` method sets it to 1".

Ian Beauregard  Sep 23, 2020 
PDF
Page 347
last paragraph

"you clone model A’s architecture with clone.model()" => clone_model() instead of clone.model()

Note from the Author or Editor:
Good catch, thanks! I fixed this typo, it should be good in the next reprint.

Safouane Chergui  Oct 19, 2019  Nov 22, 2019
Printed
Page 347
last paragraph

The following is the original last line of the last paragraph: To do this, you clone model A’s architecture with clone.model(), then copy its weights (since clone_model() does not clone the weights): But there is no such function clone.model() it should be clone_model().

Note from the Author or Editor:
Great catch, thanks. Indeed, it should be clone_model(), not clone.model().

Dhruba Ray  Jul 12, 2020  Aug 14, 2020
Printed
Page 354
Figure 11-6

I suggest adding a negative sign before η∇_1 and η∇_2.

Note from the Author or Editor:
Oh yikes, you're absolutely right! Thanks, I'm updating the figure now.

Ian Beauregard  Sep 24, 2020 
Printed
Page 356
Eq. 11-8

(2nd release) T shoud be t in 3rd and 4th eq., because next sentence is ".. t represents the iteration number (starting at 1).". Thanks.

Note from the Author or Editor:
Great catch! Indeed it should be a lowercase italic _t_. Thanks!

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 357
Below AdaMax

(2nd release) ".. the gradients in s (with a greater weight for more recent weights)." I'm not sure, but did it mean 'recent gradients'? Thanks

Note from the Author or Editor:
Great catch, I meant to write "recent gradients", not "recent weights". Thanks!

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 368
code example

In the code example on page 367 you create a sequential keras model called "model". On page 368 you call this model directly on the test set as follows: model(X_test_scaled, training=True) Perhaps I missed something, but I don't remember any explanation about what happens when you call a sequential model directly on a test set. (I'm assuming the model has been compiled and fit in the meanwhile, but that this code was omitted for brevity.) I would expect to see a method call, such as: model.evaluate(X_test_scaled, training=True) I expect this is just a typo (omitting the method)? If this is indeed the intended code, could you clarify what it means to call such a model directly? Thanks for the great book!

Note from the Author or Editor:
Thanks for your feedback. That's a great question. A Keras model can be used like a regular Keras layer (in Chapter 12, we see how this makes it possible to easily compose models containing other models). Just like any layer, you can thus pass any NumPy array or TF tensor to a model directly, using the model like a function (you can do this with any layer, as we saw in the Functional API): X = tf.constant([...]) # or np.array([...]) model(X) # returns a TensorFlow tensor model.predict(X) # returns a NumPy array model(X) is similar to model.predict(X) except it returns a TF tensor rather than a NumPy array. Another difference is that model(X) can be used in the Functional API, while model.predict(X) cannot. For example: input_A = keras.layers.Input(...) output_A = model(input_A) enclosing_model = keras.Model(inputs=[input_A, ...], outputs=[output_A, ...]) Lastly, the `training` argument is only available when using model(X), such as in model(X, training=True). This argument is not available when calling model.predict(X). To clarify this, I replaced the following sentences: """ We just make 100 predictions over the test set, setting `training=True` to ensure that the `Dropout` layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. Recall that `predict()` returns a matrix with one row per instance and one column per class. """ with these: """ Note that `model(X)` is similar to `model.predict(X)` except it returns a tensor rather than a NumPy array, and it supports the `training` argument. In this code example, setting `training=True` ensures that the `Dropout` layer remains active, so all predictions will be a bit different. We just make 100 predictions over the test set, and we stack them. Each call to the model returns a matrix with one row per instance and one column per class. """ Thanks again!

Willem  Apr 09, 2020  Aug 14, 2020
Printed, Safari Books Online
Page 373
None

I have the printed book, and on the end of chapter 11, there is no questions 9, 10. And on the github and on the appendix A, there is reference for questions 9, 10. *There is also no questions 9, 10 on the website.

Note from the Author or Editor:
Thanks for your feedback. Indeed, I fixed the appendix A to say that the solution to question 8 is available on github (there are no questions 9 and 10). I also pushed the solution to this exercise on github.

Yagel  Dec 18, 2019  Mar 13, 2020
Printed
Page 379
Under 'Using TensorFlow like NumPy'

(2nd release) "A tensor is usually a multidimensional array (exactly like a NumPy ndarray), but it can also hold a scalar(a simple value, such as 42)". It seem to numpy can't hold a scalar, but as you may know there is a array scalar in numpy. ```python s = np.array(3) print(s, type(s)) ``` 3 <class 'numpy.ndarray'> Thanks.

Note from the Author or Editor:
Good point, people could indeed interpret this as meaning that NumPy does not support scalar. I rewrote the sentence like this: A tensor is very similar to a NumPy `ndarray`: it is usually a multidimensional array, but it can also hold a scalar (a simple value, such as `42`).

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 381
Last sentence in box

"Here is as simple example" should be "Here is a simple example."

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Sep 28, 2020 
Printed
Page 383
RaggedTensor block

(3rd release) For "Represent static list of lists of tensors", what's the meaning of static list? As you know, raggedtensor is tensor like nested variable-length list. Also "every tensor has the same shape and data type", but list in raggedtensor can have different shape. Thanks.

Note from the Author or Editor:
Thanks for your question. By "static" I meant "immutable". I updated the book to remove the word "static", as most data structures are immutable anyway (except for Queues and TensorArrays).

Haesun Park  Apr 03, 2020  Aug 14, 2020
Printed
Page 386
1st bullet

(2nd release) In last sentence, other possible values are "sum" and "none" instead of None. Please check https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/losses/loss_reduction.py#L57 Thanks.

Note from the Author or Editor:
Good catch, indeed it should be: Other possible values are `"sum"` and `"none"`. Instead of: Other possible values are are `"sum"` and `None`. Thanks!

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 387
"# return value is just tf.nn.softplus(z)"

``` # The softplus function as defined is technically not equivalent to tf.nn.softplus. # The former is not numerically stable whereas the latter is. # Please refer to https://stackoverflow.com/questions/44230635/avoid-overflow-with-softplus-function-in-python # for details on numerically stable softplus, as well as the code below. # (Note: the code is mine; the stackoverflow answer is not.) import numpy as np import tensorflow as tf softplus_numpy = lambda a: np.log(np.exp(a)+1.0) softplus_tensorflow = lambda a: tf.math.log(tf.exp(a)+1.0) softplus_numpy_numerically_stable = lambda a: np.maximum(a,0)+softplus_numpy(-np.abs(a)) softplus_tensorflow_numerically_stable = lambda a: tf.maximum(a,0)+softplus_tensorflow(-tf.abs(a)) a = 10.0**(np.arange(9)-4) a = np.array([a, -a]) print(a,'\n') print(softplus_numpy(a),'\n') print(softplus_numpy_numerically_stable(a),'\n') print(softplus_tensorflow(a),'\n') print(softplus_tensorflow_numerically_stable(a),'\n') print(tf.nn.softplus(a)-softplus_numpy_numerically_stable(a),'\n') print(tf.nn.softplus(a)-softplus_tensorflow_numerically_stable(a),'\n') ``` output: ``` [[ 1.e-04 1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03 1.e+04] [-1.e-04 -1.e-03 -1.e-02 -1.e-01 -1.e+00 -1.e+01 -1.e+02 -1.e+03 -1.e+04]] [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 inf inf] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]] [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 1.00000000e+03 1.00000000e+04] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]] tf.Tensor( [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 inf inf] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[6.93197182e-01 6.93647306e-01 6.98159681e-01 7.44396660e-01 1.31326169e+00 1.00000454e+01 1.00000000e+02 1.00000000e+03 1.00000000e+04] [6.93097182e-01 6.92647306e-01 6.88159681e-01 6.44396660e-01 3.13261688e-01 4.53988992e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[ 1.11022302e-16 0.00000000e+00 -1.11022302e-16 1.11022302e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -5.88857305e-18 3.72007598e-44 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) tf.Tensor( [[ 1.11022302e-16 0.00000000e+00 -1.11022302e-16 1.11022302e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 -5.88857305e-18 3.72007598e-44 0.00000000e+00 0.00000000e+00]], shape=(2, 9), dtype=float64) /Users/mlhull5148/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: overflow encountered in exp if __name__ == '__main__': ```

Note from the Author or Editor:
Thanks for your great feedback and working code. Indeed, this numerical instability is definitely worth noting in the book. I replaced: def my_softplus(z): # return value is just tf.nn.softplus(z) return tf.math.log(tf.exp(z) + 1.0) With: def my_softplus(z): # note: tf.nn.softplus(z) better handles large inputs return tf.math.log(tf.exp(z) + 1.0) Thanks again! Aurelien

Chris Coffee  Aug 18, 2020  Sep 18, 2020
Printed
Page 392
Footnote 8

(2nd release) keras.activations.get() is available in multi-backend keras. Please check https://github.com/keras-team/keras/blob/master/keras/activations.py#L211 "You could use use `keras.activations.Activation` instead" should be "You could use use `keras.layers.Activation` instead". Thanks

Note from the Author or Editor:
Great catch, indeed I meant to write `keras.layers.Activation` instead of `keras.activations.Activation`, thanks!

Haesun Park  Nov 15, 2019  Nov 22, 2019
Printed
Page 421
code at top of the page

In the code for def csv_reader_dataset(...): dataset.shuffle(..) and .repeat(...) should be before .map(...) and .interleave(...), respectively, to agree with Figure 13.2 and avoid having to preprocess the whole shuffle buffer before being able to produce a batch.

Note from the Author or Editor:
Nice catch! I'll make the code consistent with the figure. :) Note however that it makes little difference in terms of performance: the map() function does not actually transform the whole dataset before passing the data on to the next step. Instead, it transforms just what is needed for the next steps, on the fly. You can think of each step as a queue which waits until a consumer (i.e., the next step) tries to pull elements out of it before it pulls elements from the previous queue. So it's easier to understand when you start from the end of the pipeline: looking at the code, the prefetch() method pulls from the batch() method, which pulls from the shuffle() method, which pulls rom the map() method, and so on. The batch() method only pulls as many elements as are required to fill the batch, so whether the map() method or the shuffle() method is first makes little difference: in both cases, if the batch size is 32, then only 32 items will be shuffled by the shuffle() method and preprocessed by the map() method (except when pulling the first batch, which requires first filling up the shuffle buffer, in both cases).

Wolfram Helwig  Jan 21, 2020  Mar 13, 2020
Printed
Page 439
1st paragraph

(3rd Release) "the final vector will be [1/log(200), 0/log(10), 2/log(100)]" is not correct. TextVectorization class use `log(1 + total_num_of_docs / (1 + num_of_docs_which_contain_word))` to compute IDF. Please check https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/tensorflow/python/keras/layers/preprocessing/text_vectorization.py#L770 Thanks.

Note from the Author or Editor:
Thanks a lot for your feedback. There are so many variants of TF-IDF. The TextVectorization class uses f * log(1 + N/(1+n)), where: * f is the number of occurrences of the term in the document * N is the total number of documents * n is the number of documents where the term occurs. This variant of TF-IDF is not listed in the TF-IDF Wikipedia page: https://en.wikipedia.org/wiki/Tf%E2%80%93idf The Term-Frequence part (f) is standard (it's called the "raw count" in the Wikipedia page), however the Inverse-Document-Frequency part (log(1+N/(1+n))) is not. It is close to log(N/n), which is the "default" IDF, but it uses 1+n instead of n, probably to avoid a possible division by zero, and it adds 1 to N/(1+n), probably to avoid approaching log(0). I think these extra +1s are a bit too much of a technical detail to mention in the TF-IDF paragraph in the book, but I changed the paragraph to present the proper IDF term log(N/n) rather than 1/log(n), which is not listed in the Wikipedia page (I remember trying to make things extra simple, but I probably went too far, as this variant is not listed in the Wikipedia page). Here is the updated paragraph: """ [...] This is often done using a technique called _Term-Frequency_ × _Inverse-Document-Frequency_ (TF-IDF). There are many variants, but a common one consists in computing the ratio of training instances in which the word appears, and multiplying the word count by the log of the inverse of that ratio. For example, let's imagine that the words `"and"`, `"basketball"`, and `"more"` appear respectively in 90%, 10%, and 50% of all text instances in the training set: in this case, the final vector will be `[1*log(1/0.9), 0*log(1/0.1), 2*log(1/0.5)]`, which is approximately equal to `[0.1, 0.0, 1.4]`. The `TextVectorization` layer will have an option to perform TF-IDF. """ Thanks again!

Haesun Park  Dec 12, 2019  Aug 14, 2020
Printed
Page 441
In a tip box

(3rd release) `load()` function don't shuffle shards by default(`shuffle_files=False`). Please check https://www.tensorflow.org/datasets/api_docs/python/tfds/load Test set can be shuffled too, if it use multiple shards. Please check https://github.com/tensorflow/datasets/blob/845e4d0e1dfa73060ab2f6cfdf7ba342434e4def/tensorflow_datasets/image/celeba.py#L148

Note from the Author or Editor:
Thanks for your feedback. When I run tfds.load(...) with an old version of TFDS (1.2.0), I get the following warning: WARNING:absl:Warning: Setting shuffle_files=True because split=TRAIN and shuffle_files=None. This behavior will be deprecated on 2019-08-06, at which point shuffle_files=False will be the default for all splits. So it seems that the logic changed since I wrote that chapter. I updated the tip to this: The `load()` function can shuffle the files it downloads: just set `shuffle_files=True`. However, this may be insufficient, so it's best to shuffle the training data some more.

Haesun Park  Dec 12, 2019  Mar 13, 2020
Printed
Page 453
Eq. 14-1

(3rd release) x_{i', j', k'} \cdot w_{u, v, k', k} should be x_{i', j', k'} \times w_{u, v, k', k} Thanks.

Note from the Author or Editor:
Indeed, it would make it a bit clearer that this is a multiplication, not a dot product. And more consistent with the right part of the equation. Fixed, thanks! :)

Haesun Park  Dec 12, 2019  Mar 13, 2020
Printed
Page 458
1st paragraph

(3rd release) "(but there is still 75% invariance)" should be "(but there is still 50% invariance)". Thanks.

Note from the Author or Editor:
Nice catch, indeed, 50% of the output pixels remain unchanged, and 50% change. Fixed, thanks!

Haesun Park  Dec 12, 2019  Mar 13, 2020
Printed
Page 462
1st bullet

(3rd release) I suggest that "no stride" is replaced with "stride 1" to prevent misunderstanding. Thanks.

Note from the Author or Editor:
Indeed, it's clearer. Fixed, thank you.

Haesun Park  Dec 12, 2019  Mar 13, 2020
PDF
Page 466
3rd to last paragraph

The AlexNet hyper-parameters for local response normalization do not seem to match up to what's mentioned in the paper. In Section 3.3 of the paper the hyper-parameters are set at k=2, r=5 (which is called n in the paper), alpha=0.0001, and beta=0.75 but in the textbook they're set at k=1, r=2, alpha=0.00002, and beta=0.75.

Note from the Author or Editor:
Good catch, thanks! Mmh, I wonder where I got these wrong numbers from, I certainly didn't invent them. I suspect I was looking at a specific AlexNet implementation. Or maybe I just needed more coffee... Anyway, thanks again, this is fixed now.

Amrit Purshotam  Jun 10, 2020  Aug 14, 2020
Printed
Page 491
Last paragraph in mAP box

(3rd release) COCO makes no distinction between AP and mAP. But I suggest that "(noted AP@[.50:.95] or AP@[.50:0.05:.95])" is replaced with "(noted mAP@[.50:.95] or mAP@[.50:0.05:.95])" to match the sentence "Yes, that's a mean mean average". :) Thanks.

Note from the Author or Editor:
Indeed, they seem to use both AP@ or mAP@. Changed, thanks!

Haesun Park  Dec 12, 2019  Mar 13, 2020
Printed
Page 501
Last sentence

I suggest replacing "more complex than in Figure 15-4 suggests" with "more complex than what Figure 15-4 suggests".

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Oct 09, 2020 
Printed
Page 510
footnote 2

filter_size=1 should be kernel_size=1

Note from the Author or Editor:
Good catch, thanks. Indeed, it should be kernel_size, not filter_size.

Wolfram Helwig  Jan 15, 2020  Mar 13, 2020
Printed
Page 527
code after second paragraph

It should be tokenizer.fit_on_texts(shakespeare_text) instead of tokenizer.fit_on_texts([shakespeare_text]) in order that the subsequent call dataset_size = tokenizer.document_count # total number of characters really returns the number of characters in shakespeare_text (1115394). Otherwise (with square brackets), data_size will be equal to the number of submitted documents (1 in this case).

Note from the Author or Editor:
Good catch, thanks a lot! Indeed, the code should be: tokenizer.fit_on_texts(shakespeare_text) instead of: tokenizer.fit_on_texts([shakespeare_text]) I fixed the code in the book (note that the code in the Jupyter notebook was correct). Thanks again!

Christoph Brauer  Nov 30, 2019  Mar 13, 2020
Printed
Page 530
First paragraph

I suggest replacing the first full sentence of the page with : "Then we can batch the windows and separate the inputs (the first 100 characters) from the targets (the last 100 characters)." At present, the sentence reads "... from the target (the last character)."

Note from the Author or Editor:
Oh wow, great catch! Indeed, the current text does not match the code example. :/ Thanks a lot!

Ian Beauregard  Oct 15, 2020 
Printed
Page 533
1st paragraph

"Window" should be "windows" in the sentence "... and the following batch would not continue each of these window where it left off".

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Oct 15, 2020 
Printed
Page 535
2nd paragraph

"Start-of-sequence (SSS)" should probably be "start-of-sequence (SOS)", considering the code block that follows.

Note from the Author or Editor:
Good catch, that was a typo, it should be SoS, not SSS. Thanks!

Ian Beauregard  Oct 16, 2020 
Printed
Page 551
11th line from the bottom

(2nd release) "(i.e., h_(f)) rather than h_(t-1))" should be "(i.e., h_(f) rather than h_(t-1))" Thanks

Note from the Author or Editor:
Good catch, thanks. However, it's h(t), not h(f): (i.e., h_(t) rather than h_(t-1))

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 556
6th line from the bottom

(2nd release) "Attention Is All You Need: The Transformer Architecture" section uses both of "positional encoding" and "positional embedding". The paragraph that start with "The positional embedding are simply dense vectors..." explain the component in Figure 16-8. So I suggest to change it to "positional encoding". TensorFlow has positional_embedding layer, but I think that positional encoding is more common term. How about using one of the two terms? :) Thanks

Note from the Author or Editor:
Good point, thanks. After checking the original paper, it seems that they consistently use the term "Positional Encoding", except when they talk about "Trainable Positional Embeddings". So I replaced every occurence of the word "Positional Embedding" with "Positional Encoding", including in the code example on page 558.

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 557
Last paragraph

"... and represented at the bottom of Figure 16-9 (transposed)..." I think "bottom" should be replaced with "top". Note: In my copy of the book, this mistake was corrected in the caption of Figure 16-9, but not in the body of the text.

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Oct 16, 2020 
Printed
Page 558
1st paragraph

The word "bottom" should be replaced with "top" in "... the vertical dashed line at the bottom left of Figure 16-9..."

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Oct 17, 2020 
Printed
Page 560
2nd line

(2nd release) "d_values is the number of each value" should be "d_values is the number of dimensions of each value". Thanks

Note from the Author or Editor:
Good catch, thanks! Fixed.

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 562
picture at the top (figure 16-10)

In the picture it looks like the linear transformation happens before the copying for each scaled dot-product attention head. This would mean that every head gets the same input, which would be useless. Rather, the inputs should be copied first, then transformed differently for each head. This is also what the equivalent figure in the current version of ‘Attention Is All You Need’ shows.

Note from the Author or Editor:
Thanks for your feedback, this is a great observation. I think the reason why the paper originally had a figure which showed a Linear step followed by a Split step (for the value V, the key K and the query Q), is that it was probably the way they implemented the algorithm. In the updated diagram, they now hide this implementation detail to focus more on what the algorithm does, conceptually. Let me explain what I mean using NumPy. Suppose you want to apply two different linear transformations A and B to the same inputs X: import numpy as np X = np.array([[10., 20.], [30., 40.]]) A = np.array([[2., 3., 4.], [5., 6., 7.]]) B = np.array([[8., 9., 10.], [11., 12., 13.]]) One approach is to compute this: R1 = X @ A R2 = X @ B Recall that @ represents matrix multiplication. In this example, this gives the following results: >>> R1 array([[120., 150., 180.], [260., 330., 400.]]) >>> R2 array([[300., 330., 360.], [680., 750., 820.]]) Now, another approach is to concatenate A and B horizontally into a new matrix M, then compute X @ M: M = np.concatenate([A, B], axis=1) R = X @ M Notice that R is just the horizontal concatenation of R1 and R2: >>> R array([[120., 150., 180., 300., 330., 360.], [260., 330., 400., 680., 750., 820.]]) So all we need to do to get R1 and R2 is to split M appropriately: R1 = R[:, 0:3] R2 = R[:, 3:6] You can see that this approach gives the same result as earlier. One advantage of this approach is that it requires a single big matrix multiplication, rather than multiple small ones, so it is faster, especially on a GPU. Moreover, the concatenation step is not needed in practice, since we don't need to have multiple transformation matrices in the first place: a single big matrix will do (it's a single trainable variable instead of multiple ones). However, this is an implementation detail, so it's probably best left out of the book (just as the authors of the paper judged that it was best left out of the paper). I'll get the latest version of the diagram for the next release of my book. Thanks again for your feedback!

Richard Möhn  Dec 16, 2019  Mar 13, 2020
Printed
Page 588
Equation 17-3 and following paragraph

Equation 17-3 (p. 588) has variable K, but the following paragraph doesn't define K and instead defines n, which is not in the equation

Note from the Author or Editor:
Great catch, thanks. Indeed, the K should be an n in this equation, as well as in equation 17-4. Fixed!

Patrick Coulombe  Jan 26, 2020  Mar 13, 2020
PDF
Page 602
2nd paragraph

In the second paragraph the book says: "For example, when growing the generator’s outputs from 4 × 4 to 8 × 8 (see Figure 17-19), an upsampling layer (using nearest neighbor filtering) is added to the existing convolutional layer, so it outputs 8 × 8 feature maps, which are then fed to the new convolutional layer (which uses "same" padding and strides of 1, so its outputs are also 8 × 8). This new layer is followed by a new output convolutional layer: this is a regular convolutional layer with kernel size 1 that projects the outputs down to the desired number of color channels (e.g., 3)." It seems that you are talking about an Upsampling layer, a conv layer with same padding and kernel size not equal to 1 and a final conv layer with kernel size 1. However, I can't see any conv block before output conv layer (with kernel size 1). Did I miss something? Can you calrify this issue? Thank you very much.

Note from the Author or Editor:
Thanks for your feedback, I'm sorry this section wasn't clear enough. This paragraph describes what is added in the right side of Figure 17-19 compared to the left side. This includes the Upsampling layer, plus the two new Convolutional layers (with dashed borders), and the components needed to perform the "fade-in" operation (i.e., the *alpha operation, the *(1-alpha) operation, and the + operation). The 4 other layers are just the same as the ones on the left part of the figure: this includes the Noise layer, the Dense layer, the Conv 1 layer and the original Output Conv Layer (the one with a solid border). If the transition was brutal, without any fade-in mechanism, then we would just remove the original output layer (called "Out conv" with solid border) instantly, and just add the new layers directly: the Upsampling layer and the two new convolutional layers (with dashed borders), and there would be no need for the fade-in operations. Another thing that might have confused you is the fact that the original convolutional layer now outputs 8x8 feature maps. This is not because it was changed in any way, it's just because it now receives 8x8 inputs instead of 4x4 inputs. It really is exactly the same "Out conv" layer as on the left side of the figure. I hope this helps! I'll see what I can do to make this clearer in the book. Thanks again for your feedback.

Hadi  Sep 10, 2020  Sep 18, 2020
Printed
Page 627
Equation 18-1 (1st release)

If I'm not mistaken it should be \sum_{s'} not \sum_{s}

Note from the Author or Editor:
Oh, great catch, thanks a lot, this was a typo. Equation 18-1 should sum over s', not over s. FYI, there's also a typo in equation 18-3: it should say "for all (s, a)", not "for all (s' a)".

Julien Theron  Apr 28, 2020  Aug 14, 2020
Printed
Page 628
Eq. 18-3

(2nd release) "for all (s' a)" should be "for all (s', a)". Thanks

Note from the Author or Editor:
Great catch! Actually, it should be "for all (s, a)" Thanks!

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 636
code snippet (training_step)

Even though the code runs without a problem, the algorithm won't be properly trained because the loss is falsely computed. The lossfn in this case (mean_squared_error) expects two list of lists. One being the the Q_values list (which is correct) and the other the target_Q_values (here is the problem). For a quick fix to test you could just do something like so: target_Q_values = [[el] for el in target_Q_values] Now if you compare the two, (I tested with 10.000 iterations), you should see a great difference.

Note from the Author or Editor:
Thanks a lot for your feedback, that's a great catch. Indeed, target_Q_values should be a column vector. I added the following line just after the definition of target_Q_values, to convert it from a 1D array to a column vector: target_Q_values = target_Q_values.reshape(-1, 1) I fixed the book and the notebook, and I added a comment about this in the notebook.

Lukas Schmidt  Dec 17, 2019  Mar 13, 2020
Printed
Page 640
Last line

In "a transition (s, r, s')", I believe 'r' should be replaced with 'a'.

Note from the Author or Editor:
Good catch, thanks.

Ian Beauregard  Oct 26, 2020 
Printed
Page 646
5th paragraph

(2nd release) VideoWrapper is not yet implemented. :) Thanks

Note from the Author or Editor:
Thanks for your feedback. Indeed, apparently the VideoWrapper was removed. I removed it from the book.

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 648
footnote 20

(2nd release) "Pink is actually a mix of blue and red" should be "Pink is actually a mix of white and red". Thanks

Note from the Author or Editor:
Thanks for your feedback. You're right, I should have written "purple" instead of "pink". Fixed! :)

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 655
3rd paragraph

(2nd release) "add_method()" should be "add_batch()". Thanks.

Note from the Author or Editor:
Good catch, thanks! I meant to say "the add_batch() method" but I wrote "the add_method() method". I'm guessing it was 2am. ;-) Fixed!

Haesun Park  Feb 05, 2020  Mar 13, 2020
Printed
Page 663
Title of the last item

Unlike the other items in the list, in the last item, i.e., the Proximal Policy Optimization, the abbreviation comes before the link.

Note from the Author or Editor:
Good point, thanks. Fixed, it looks nicer now. :)

Athanasios Kyritsis  Apr 18, 2020  Aug 14, 2020
Printed
Page 692
12th line from the bottom

(3rd release) 12th line from the bottom: "Click Metric, click None to uncheck all locations" should be "Click Metric, click None to uncheck all metrics". 3rd line from the bottom: "Then click the Location drop-down menu, click None to uncheck all metrics" should be "Then click the Location drop-down menu, click None to uncheck all locations". Thanks

Note from the Author or Editor:
Great catches, thanks! Fixed.

Haesun Park  Mar 03, 2020  Mar 13, 2020
Printed
Page 693
3rd paragraph

(3rd release) I don't understand what's the meaning of "e.g., you can create handy widgets using special comments in your code". Is it https://colab.research.google.com/notebooks/widgets.ipynb? Please let me know about the special comments. Thanks.

Note from the Author or Editor:
Thanks for your feedback. I meant "handy forms". Check out https://homl.info/colabforms I changed the text in parentheses to: (e.g., you can create handy forms using special comments in your code) And "create handy forms" points to https://homl.info/colabforms

Haesun Park  Mar 03, 2020  Mar 13, 2020
Printed
Page 724
Third from last line

Bold face I printed as *I* Just a minor issue — thanks for the great book!

Note from the Author or Editor:
Good catch, thanks! The asciidoc code was: &#x2013;*I*~_m_~ I changed it to: *&#x2013;I*~_m_~ It should be better. :)

Sebastian Huber  Mar 22, 2020  Aug 14, 2020
Printed
Page 724
Solution to Exercise 7

I think both matrices appended to matrix A' should be -I_m.

Note from the Author or Editor:
Great catch. Indeed, you are right, it should be -I_m both at the top and bottom. Thank you!

Ian Beauregard  Aug 18, 2020  Sep 18, 2020
Printed
Page 731
Solution to exercise 3

Maybe I am wrong about this, but I think that there is a mistake in stating "a Logistic Regression classifier will converge to a good solution" on a dataset that is not linearly separable. Logistic regression is linear in the sense that the decision boundary is linear (which is also stated on p. 147). So I don't think that it necessarily finds a good solution on such a dataset. Or am I missing something? Thanks for the great book though - I've learned so much from reading it! :)

Note from the Author or Editor:
Thanks for your feedback. Sorry, you're right, I meant to say a "reasonably good linear decision boundary", not a solution which finds a non-linear decision boundary, as that's impossible, as you rightly point out. Let me explain: suppose you have a linearly separable dataset, except for a single outlier which is "on the wrong side". A Perceptron will just break down, and not converge at all. A Logistic Regression classifier will "do the right thing" and converge despite the outlier. The linear decision boundary it will converge to will often be good enough, but of course this really depends on the dataset and the task. I'll clarify this paragraph. Thanks again!

Mona Rahn  Mar 27, 2020  Aug 14, 2020
Printed
Page 745
Answer to question 5

In the sentence "Another benefit is that the alignment scores makes the model...", the word "makes" should be "make".

Note from the Author or Editor:
Good catch, thanks!

Anonymous  Oct 18, 2020 
Printed
Page 753
4th line

(3rd release) ParameterServerStrategy perform data parallelism. but it say "useful to train huge model that don't fit in GPU RAM". Is it an explanation for model parallelism? Thanks

Note from the Author or Editor:
Good catch, thanks. Indeed, I must have lost my train of thought back then, as it really looks like I switched to model parallelism in the very last sentence. :/ Here's a better answer: """ However, it can be useful in some situations, especially when you can take advantage of the asynchronous updates, for example to reduce I/O bottlenecks. This depends on many factors, including hardware, network topology, number of servers, model size, and more, so your mileage may vary. """

Haesun Park  Apr 03, 2020  Sep 18, 2020
Printed
Page 762
Equations C-1 and C-4

As someone already pointed out, the right-hand side of Equation C-4 should be multiplied by -1. Specifically, if you start from Equation C-1 and plug the results from Equations C-3 therein, what you will get is Equation C-4, but with the right-hand side multiplied by -1. It is however correct to say that the current form of the function written at Equation C-4 should be minimized. Consequently, the correct form (current form multiplied by -1) should be maximized. Indeed, in the dual form of the SVM problem, we should first find w and b that minimize the Generalized Lagrangian, with fixed alpha (as was done with the operations leading the Equation C-4). But then, we should find alpha that MAXIMIZES (rather than minimizes) the Generalized Lagrangian (evaluated at w* and b* as found previously). If you look at Equation C-1, you can see that the second term on the right-hand side is always negative if the constraints are respected. So there is no minimum with respect to alpha.

Note from the Author or Editor:
Thanks a lot for your feedback and for the detailed explanation. I wrongly thought it wasn't an error the first time this was reported, because I figured that minimizing -L was equivalent to maximizing +L, but of course when plugging the results from equation C-3 into the Generalized Lagrangian from equation C-1, we get reversed signs compared to what I had in equation C-4. My sincere apologies to whoever was misled by this error. I've now fixed equation C-4 to invert the signs, I replaced "minimizes" with "maximizes" and I also specified that this is also subject to \sum_{i=1}^m \alpha^{(i)} t^{(i)} = 0. I trust it's all good now. :) Thanks again!

Ian Beauregard  Aug 16, 2020  Sep 18, 2020
Printed, PDF
Page 763
equation C-4

The primal problem is to minimize Equation C-1, but a negative sign is missing on page 763 to derive equation C-4. Since our initial target is to minimize the Lagrange, now we should maximize C-4. At the same time, the second equation in C-3 is a constrained condition for the dual problem. What is more, the equation of the third bullet times a^(i) are also constrains for the dual problem. The equation in chapter 5 is also incorrect. I have a very small request, when you are using some symbols, please define it before use. For example, n_s is not defined on page 763. It should be the number of support vectors found in the problem.

Note from the Author or Editor:
Thanks for your excellent feedback, I really appreciate it! > The primal problem is to minimize Equation C-1, but a negative sign is missing on page 763 to derive equation C-4. Since our initial target is to minimize the Lagrange, now we should maximize C-4. Unless I overlooked something, I think the sign is correct in equation C-4: in the sentence following this equation, I mentioned that the goal is to minimize the loss, not maximize it. We could reverse the sign and try to maximize the equation instead, but it's really equivalent. > At the same time, the second equation in C-3 is a constrained condition for the dual problem. What is more, the equation of the third bullet times a^(i) are also constrains for the dual problem. Good catch, thanks a lot. I need to add "and \sum_{i=1}^m \alpha^{(i)} t^{(i)} = 0" at the end of equation C-4. > The equation in chapter 5 is also incorrect. Yes, I'll add the missing constraint there as well. > I have a very small request, when you are using some symbols, please define it before use. For example, n_s is not defined on page 763. It should be the number of support vectors found in the problem. Indeed, I try to always define the symbols I use, but apparently I missed this one. Please tell me if you find any other missing definition. Thanks again! :)

Anonymous  Jul 18, 2020  Aug 14, 2020
Printed
Page 769
Sentence below Figure D-2,

In the autodiff appendix, the sentence below figure D-2 should say “To compute df/dy”, not df/dx.

Note from the Author or Editor:
Thanks for your feedback. Indeed, this was an error I fixed in March 2020, so hopefully the latest releases should be okay now: it should read "To compute df/dy(3,4)...", not "df/dx".

Kenny Song  Apr 20, 2020  Aug 14, 2020
Printed
Page 793
10th line from the bottom

(3rd release) "the output of the addition operation" should be "the output of the power operation". Thanks.

Note from the Author or Editor:
Great catch, thanks.

Haesun Park  Apr 03, 2020  Aug 14, 2020
ePub, Mobi, Safari Books Online
Page 1212
text

Current Copy takes care of load balancing and scaling for you. It take JSON requests containing the input data (e.g., of a district) Suggested "you. It take JSON requests containing" should be "you. It takes JSON requests containing"

Note from the Author or Editor:
Good catch, thanks. Fixed!

Anonymous  Jan 22, 2020  Mar 13, 2020
Other Digital Version
2294-2340
Chapter 3 MultiClass Classifaction paragraph 2 and section on SGDClassifier for multiclass classification

There is some conflicting information in the Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow book and the sci-kit learn documentation. In chapter 3 under Multiclass Classification the author states twice that the stochastic gradient descent classifier (SGDClassifier) can handle multi-class classification problems directly without training multiple binary classifiers using One vs Rest/All. This is listed in the second paragraph as well as one or two pages later. The documentation for the SGDClassifier in sci-kit learn directly contradicts this. It states, “SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all” (OVA) scheme” (https://scikit-learn.org/stable/modules/sgd.html) Also, the statement about Logistic Regression being only a binary classifier seems to contradict the sci-kit learn documentation as well. Using the multinomial option, the LR model can learn a true multinomial distribution for multi-class problems (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). Either the book seems incorrect or the sci-kit learn documentation is.

Note from the Author or Editor:
Great feedback, thanks a lot! Regarding the LogisticRegression class, the default value for the multi_class argument changed after the 2nd edition was published (in version 0.22) from 'ovr' to 'auto': so indeed, the new default multi-class behavior is to learn a true multinomial distribution (the old behavior was to train multiple binary classifiers and to use the OvR strategy). I'll update the book for future releases. Regarding the SGDClassifier class, however, it really seems to be a mistake on my part. :( I tried to search for the origin of my error, perhaps a previous version used a different approach, but it seems that the SGDClassifier behavior has been the same since at least Scikit-Learn 0.17. I'm really sorry about this, I'll update the book now for future releases. Thanks again for your contribution.

Ryan Boch  Jan 08, 2020  Mar 13, 2020
ePub
Page 8499
Chaper 12, custom metrics

In defining "precision" in the code, it should be "p" to consistent with code that follows. i.e. >>> p=keras.metrics.Precision() ... etc.. Then when call >> p.result() it will work

Note from the Author or Editor:
Great catch! For clarity, I decided to name the variable `precision` everywhere. Thanks for your feedback!

Mohammed El-Beltagy  Oct 27, 2019  Nov 22, 2019
ePub, Mobi, Safari Books Online
Page 11387
Ch 15

In Exercise 10 in chapter 15, there is a bad url that leads to a 404: "“Download the Bach chorales dataset and unzip it.” The link goes to https://homl.info/bach which is not found.

Note from the Author or Editor:
Thanks for your feedback, I fixed the broken URL, it works now.

Anonymous  Jan 22, 2020  Mar 13, 2020