2. REINFORCE

This chapter introduces the first algorithm of the book, REINFORCE.

The REINFORCE algorithm, invented by Ronald J. Williams in 1992 in his paper “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” [148], learns a parametrized policy which produces action probabilities from states. Agents use this policy directly to act in an environment.

The key idea is that during learning, actions that resulted in good outcomes should become more probable—these actions are positively reinforced. Conversely, actions which resulted in bad outcomes should become less probable. If learning is successful, over the course of many iterations action probabilities produced by the policy shift to distribution that ...

Get Foundations of Deep Reinforcement Learning: Theory and Practice in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.