In this recipe, we solve the Blackjack game with off-policy MC.
In Step 4, the off-policy MC control algorithm does the following tasks:
- It initializes the Q-function with arbitrary small values.
- It runs n_episode episodes.
- For each episode, it performs the behavior policy to generate the states, actions, and rewards; it performs policy evaluation on the target policy using first-visit MC prediction based on the common steps; and it updates the Q-function based on the weighted return.
- In the end, the optimal Q-function is finalized, and the optimal policy is obtained by taking the best action for each state in the optimal Q-function.
It learns about the target policy by observing another agent and reusing the experience ...