In this section, we are going to analyze a strategy to find an optimal policy based on a complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy, π; we can define the intrinsic value of a state, st, as the expected discounted return obtained by the agent starting from st and following the stochastic policy, π:
In this case, we are assuming that, as the agent will follow π, state sa is more useful than sb if the expected return starting ...