April 2018
Intermediate to advanced
334 pages
10h 18m
English
The idea here is to learn a policy which maximizes a metric instead of minimizing the loss obtained from maximum likelihood objective. For this, a reinforcement learning approach is used, where a self-critical policy gradient algorithm is used for training. For this training, two separate output sequences are generated at each training iteration:
is obtained by sampling from the probability distribution of
at each decoding time stepRead now
Unlock full access