Dynamic Programming (DP) represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of an MDP. The fundamental idea of DP, as well as reinforcement learning in general, is the use of state values and actions to look for good policies.
The DP methods approach the resolution of MDP processes through the iteration of two processes called policy evaluation and policy improvement:
- The policy evaluation algorithm consists of applying an iterative method to the resolution of the Bellman equation. Since convergence is guaranteed to us only for k → ∞, we must be content to have good approximations by imposing a stopping condition.
- The policy improvement algorithm ...