Summary

In this chapter, we discussed how temporal difference learning, the third thread of RL, combined to develop TD(0) and Q-learning. We did that by first exploring the temporal credit assignment problem and how it differed from the credit assignment problem. From that, we learned how TD learning works and how TD(0) or first step TD can be reduced to Q-learning.

After that, we again played on the FrozenLake environment to understand how the new algorithm compared to our past efforts. Using model-free off-policy Q-learning allowed us to tackle the more difficult Taxi environment problem. This is where we learned how to tune hyperparameters and finally looked at the difference between off- and on-policy learning. In the next chapter, we ...

Get Hands-On Reinforcement Learning for Games now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.