The authors note that bidirectional models with denoising autoencoding pretraining (such as BERT) achieve better performance compared to unidirectional autoregressive models (such as transformer-XL). But as we mentioned in the Pretraining subsection of the Bidirectional encoder representations from transformers section, the [MASK] token introduces a discrepancy between the pretraining and fine-tuning steps. To overcome these limitations, the authors of transformer-XL propose XLNet (see XLNet: Generalized Autoregressive Pretraining for Language Understanding at https://arxiv.org/abs/1906.08237): a generalized autoregressive pretraining mechanism that enables learning bidirectional contexts by maximizing the expected likelihood over all ...

Get Advanced Deep Learning with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.