The idea here is to learn a policy which maximizes a metric instead of minimizing the loss obtained from maximum likelihood objective. For this, a reinforcement learning approach is used, where a self-critical policy gradient algorithm is used for training. For this training, two separate output sequences are generated at each training iteration:
- is obtained by sampling from the probability distribution of at each decoding time step
- is the baseline output obtained by maximizing the output probability distribution at each time ...