Up to now, the book has focused on model-based and model-free methods. All the algorithms using these methods estimate the state or state-action values for a given current policy as the first step. In the second step, these estimated values are used to find a better policy by choosing the best action in a given state. These two steps are carried out in a loop until no further improvement in values is observed. In this chapter, you look at a different approach for learning optimal policies, by directly ...
8. Policy Gradient Algorithms
Get Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.