Q-learning
Now, we come to the real meat of our system: the Q-value function. This includes the cumulative expected reward for actions a1, a2, and a given state, s. We are, of course, interested in finding the optimal Q-value function. This means that not only do we have a given (s, a), but we have trainable parameters (the sum of the product) of the weights and biases in our DQN that we modify or update as we train our network. These parameters allow us to define an optimal policy, that is, a function to apply to any given states and actions available to the agent. This yields an optimal Q-value function, one that theoretically tells our agent what the best course of action is at any step. A bad football analogy might be the Q-value function ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access