Lemma 7.5.2.4. Let λt ≥ 0 be a step size satisfying

$eq2202.jpg$

The Q-learning algorithm in Equation (7.22) converges almost surely to the value.

7.5.3     Connection to Differential Dynamic Programming

Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. Many interesting results have been obtained for models with finite state and action space. Recently, [111] establishes connections between Q-learning and nonlinear control of continuous-time models with general pay-off functions. The authors show that the Hamiltonian appearing in nonlinear control theory is essentially the same as the Q-function that is the object of interest in ...

