January 2020
Intermediate to advanced
432 pages
10h 18m
English
Finally, we can see how all this comes together to train the agent to learn a policy. Open up Chapter_6_DQN.py again, and follow the next exercise to see how loss is calculated:
def compute_td_loss(batch_size): state, action, reward, next_state, done = replay_buffer.sample(batch_size) state = autograd.Variable(torch.FloatTensor(np.float32(state))) next_state = autograd.Variable(torch.FloatTensor(np.float32(next_state)), volatile=True) action = autograd.Variable(torch.LongTensor(action)) reward = autograd.Variable(torch.FloatTensor(reward)) done = autograd.Variable(torch.FloatTensor(done)) q_values = model(state) next_q_values ...
Read now
Unlock full access