January 2019
Intermediate to advanced
386 pages
11h 13m
English
In the preceding section, we discussed that if we follow a deterministic policy (DP), we might not reach all state/action pairs. This would undermine our efforts to estimate the action-value function. We solved this problem with the exploring-starts assumption. But this assumption is unusual and it would be best to avoid it. In fact, the core of our problem is that we follow the policy blindly, which prevents us from exploring all possible state/action pairs. Can we solve this by introducing a different policy? Turns out it can (surprise!). In this section, we'll introduce MC control with a non-deterministic epsilon-greedy (ε-greedy) policy. The core idea is simple. Most of the time the ε-greedy policy behaves ...