4
Policy Training and Human Guidance
In this chapter, we will build on the foundations established in Chapters 1 and 3 to examine policy training, the process through which an agent learns or refines a policy that maps states to actions to maximize expected reward. In the context of RLHF, policy training operationalizes the reward model by updating the model's behavior according to human-derived feedback. We will discuss key system design considerations and algorithmic improvements that address common challenges in reinforcement learning. The chapter reviews the most widely used policy training algorithms, including state-of-the-art approaches such as Deep Q-Network (DQN), which extends the capabilities of Q-learning to high-dimensional environments. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access