Now we have to face the most demanding phase, namely the training of our system. In the previous section, we said that the Gym library is focused on the episodic setting of reinforced learning. The agent's experience is divided into a series of episodes. The initial state of the agent is randomly sampled by a distribution and the interaction proceeds until the environment reaches a terminal state. This procedure is repeated for each episode with the aim of maximizing the total reward expectation per episode and achieving a high level of performance in the fewest possible episodes.
In the learning phase, we must estimate an evaluation function. This function must be able to evaluate, through the sum of the rewards, the convenience ...