Lemma 7.5.2.4. Let λt ≥ 0 be a step size satisfying

tλt=,tλt2<.

The Q-learning algorithm in Equation (7.22) converges almost surely to the value.

7.5.3     Connection to Differential Dynamic Programming

Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. Many interesting results have been obtained for models with finite state and action space. Recently, [111] establishes connections between Q-learning and nonlinear control of continuous-time models with general pay-off functions. The authors show that the Hamiltonian appearing in nonlinear control theory is essentially the same as the Q-function that is the object of interest in ...

Get Distributed Strategic Learning for Wireless Engineers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.