There are numerous approaches to solving RL problems, all of which invole finding rules for the agent's optimal behavior:
- Dynamic programming (DP) methods make the often unrealistic assumption of complete knowledge of the environment, but are the conceptual foundation for most other approaches.
- Monte Carlo (MC) methods learn about the environment and the costs and benefits of different decisions by sampling entire state-action-reward sequences.
- Temporal difference (TD) learning significantly improves sample efficiency by learning from shorter sequences. To this end, it relies on bootstrapping, which is defined as refining its estimates based on its own prior estimates.
When the RL problem outlined ...