The previous chapter looked at dynamic programming, used when you know the model dynamics p(s′, r| s, a), and the knowledge is used to “plan” the optimal actions. This is also known as the planning problem. This chapter shifts focus and looks at learning problems—that is, setups where the model dynamics (aka transition dynamics) are not known. You will learn to calculate/learn state values and state-action values by sampling—that is, collecting experience by following some policy in the real world or running ...
4. Model-Free Approaches
Get Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.