Now we have to face the most demanding phase: training of our system. In the Q-learning section, we said that the Gym library is focused on the episodic setting of reinforcement learning. The agent's experience is divided into a series of episodes. The initial state of the agent is randomly sampled by a distribution, and the interaction proceeds until the environment reaches a terminal state. This procedure is repeated for each episode, with the aim of maximizing the total reward expectation per episode and achieving a high level of performance in the fewest possible episodes.
In the learning phase, we must estimate an evaluation function. This function must be able to evaluate, through the sum of the rewards, the convenience ...