Version 
Location 
Description 
Submitted By 
Date submitted 
Date corrected 

Page ?
Chapter 2, Running the Experiment, 4th paragraph 
The asymptote of the optimal action is given as:
1/n + (1e)/n = (2e)/n
Perhaps I am missing something, but I think the asymptotic behavior should be:
e/n + (1e) = (e(1n) + n)/n
which I got from algorithm 21:
eproportion of the time, we pick the optimum solution with a probability of 1/n
During the remaining (1e) proportion of the time, we always pick the optimum solution (in the longrunning asymptote).
These two equations [ (2e)/n or (e(1n) +n)/n ] will give very different results. For example, if n = 10, the result for e > 0 (perfect exploitation) for the given equation is 1/5 while my equation always achieves 1, a perfect selection of the optimal solution, irrespective of the number of actions possible.
Also, a more minor detail  line 7 of Algorithm 21, I think the denominator should be N(a) (lower case a, not capital A), to denote we are dividing by the number of times the action was taken, not the total number of actions taken so far)
Note from the Author or Editor: Page 30, Algorithm 21:
In step 7, replace "N(A)" with "N(a)"  lowercase a.
Page 32, last paragraph which begins with: "Looking toward the end of the experiment..."
The equation near the bottom should be replaced. The text should read: "The asymptote of the optimal action is e/n + (1e), where ..."

Kenji Oman 
Dec 22, 2020 
Jan 13, 2023 

Page p.48
Figure 28 
The "Right" and "Left" labels in Figure 28 need to be reveresed.
Note from the Author or Editor: Page 48, Figure 28.
The "Right" and "Left" labels on all of the four images are the wrong way around. "Left" should be on the top, "Right" should be on the bottom.

Andrew 
Mar 29, 2021 
Jan 13, 2023 

Page Prediction Error
Chapter 1 > Fundamental Concepts in Reinforcement Learning > The First RL Algorithm > Prediction error 
The sentence “Knowledge of the previous state and the prediction error helps alter the weights. Multiplying these together, the result is δx(s)=[0,1]. Adding this to the current weights yields w=[1,0].”
I think the result of this formula `δx(s)` should be [0,1] instead of [0,1] since the prior sentence says, “The value of Equation 12, δ, is equal to −1”. Considering the state x(s) = [0,1], multiplying δx(s) would yield [0,1]. Then it would make sense that adding [0,1] to the prior weights w = [1,1] to yield the new weights w = [1,0].
Note from the Author or Editor: Page 15, the sentence that currently reads "Multiplying these together, the result is δx(s)=[0,1]."
Should be: "Multiplying these together, the result is δx(s)=[0,1]."
Note the minus sign at the end.

Nhan Tran 
Dec 28, 2022 
Jan 13, 2023 
Printed 
Page 29
Equation 23 
Equation 23 should be: "r = r + α (r'r)"  note the ' should be on the first r.

Phil Winder 
Jan 02, 2023 
Jan 13, 2023 

Page 54
Algorithm 24 
The algorithm exits the loop when DELTA is less than or equal to theta, but DELTA is always calculated as:
DELTA < max(DELTA, (anything))
DELTA will never get smaller than its initial value. If that initial value is greater than theta, the algorithm will never exit its loop.
Note from the Author or Editor: Page 54, Algorithm 24:
1. From the end of step 2, remove ", ∇ ← 0"
2. Insert a new step between 3 and 4  let's call it 3a so that the references in the text remain correct. Insert: "3.a ∇ ← 0" and indent to align with the word "loop" on line 4.

Patrick Doyle 
Jun 04, 2022 
Jan 13, 2023 

Page 62, 63, 64
(page 62) Equation 35, (page 63) Algorithm 31, (page 64) 2nd Paragraph 2nd Line 
In the QLearning formula the argmax should be just max.
Note from the Author or Editor: (page 62) Equation 35, (page 63) Algorithm 31, (page 64) 2nd Paragraph 2nd Line
Replace "argmax" with "max"

Manuel 
Mar 15, 2021 
Jan 13, 2023 

Page 65
Algorithm 32 
Step 6 states:
Choose a from s using pi, breaking ties randomly
Since this is in a loop, the value of "a" updated at the end of the loop will be obliterated by choosing a new value for "a".
Note from the Author or Editor: Page 65, Algorithm 32:
1. Change step 4 to say "s, a ← Initialize s from the environment and choose a using π"
2. Remove step 6 entirely
3. Update all subsequent numbers to be contiguous

Patrick Doyle 
Jun 05, 2022 
Jan 13, 2023 

Page 123
Algorithm 51, step 7 
Missing ln when calculating the gradient of π  it should have been:
θ ← θ + αγ^tG∇lnπ(a ∣ s, θ)
Note from the Author or Editor: Page 123, Algorithm 51:
Add "ln" to step 7, so that the equation reads as: "θ ← θ + αγ^tG∇lnπ(a ∣ s, θ)"
Page 125, Algorithm 52:
Add "ln" to step 9, so that the equation reads as: "θ ← θ + αγ^t????∇lnπ(a ∣ s, θ)"

Anonymous 
Aug 08, 2022 
Jan 13, 2023 

Page 129
Algorithm 53 
1. Variable t isn't being updated after each step.
2. At step 6. there's no need to break ties randomly, since we aren't dealing with a deterministic actionvalue function, but with a stochastic policy that outputs probabilities.
3. At step 8. V(s, θ) should have been V(s, w) (weights "w" belong to the critic model V, while weights "θ" belong to the actor model π, as denoted in step 1.).
Similar errors appear at page 134 (Algorithm 54) at the corresponding steps (6 and 8).
Note from the Author or Editor: On page 129, Algorithm 53:
1. Add a 13th step to update t: "t < t + 1", indent to align with line 12
2. Step 6: Remove ", breaking ties randomly" from the text
3. Step 8: change "V(s, θ)" to "V(s, w)"
On page 134, Algorithm 54:
1. Add a 15th step to update t: "t < t + 1", indent to align with line 12
2. Step 6: Remove ", breaking ties randomly" from the text
3. Step 8: at the end of the line change "V(s, θ)" to "V(s, w)"

Anonymous 
Aug 09, 2022 
Jan 13, 2023 