Q-learning is a model-free learning algorithm that is useful in situations when the agent knows all the possible states and the actions, which leads to these states within the search space. Q-learning is able to choose between immediate reward and the long-term reward, which enables optimization for reaching the goal of maximizing rewards accumulated over the set of actions.
Let us explain this with a simple example. Consider a maze with six locations (L∈ {0,1,2,3,4,5}) within it and when the agent comes to location number 5, it finds treasure (the end state or the agent's goal). The maze has the following structure. The bi-directional arrows indicate possible state transitions and the numbers indicate the reward:
The state transitions ...