October 2019
Intermediate to advanced
340 pages
8h 39m
English
In fact, the optimal policy was obtained after around 50 episodes. We can plot the length of each episode over time to verify this. The total reward obtained in each episode over time is also an option.
>>> length_episode = [0] * n_episode>>> total_reward_episode = [0] * n_episode
>>> def q_learning(env, gamma, n_episode, alpha): ... n_action = env.action_space.n ... Q = defaultdict(lambda: torch.zeros(n_action)) ... for episode in range(n_episode): ... state = env.reset() ... is_done = False ... while ...
Read now
Unlock full access