September 2018
Intermediate to advanced
296 pages
9h 10m
English
Once the larger supervised learning policy network is trained, we further improve the model by having the RL policy network play against a previous version of itself. The weights of the network are updated using a method called policy gradients, which is a variant of gradient descent for vanilla neural networks. Formally speaking, the gradient update rule for the weights of our RL policy network can be expressed as follows:

Here,
are the weights of the RL policy network, , and is the expected reward at ...
Read now
Unlock full access