A brief recap of RL
In the beginning, a policy is initialized randomly and used to interact with the environment for either a given number of steps, or entire trajectories, to collect data. On each interaction, the state visited, the action taken, and the reward obtained are recorded. This information provides a full description of the influence of the agent in the environment. Then, in order to improve the policy, the backpropagation algorithm (based on the loss function, in order to move the predictions to a better estimate) computes the gradient of each weight of the network. These gradients are then applied with a stochastic gradient descent optimizer. This process (gathering data from the environment and optimizing the neural network ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access