The authors note that bidirectional models with denoising autoencoding pretraining (such as BERT) achieve better performance compared to unidirectional autoregressive models (such as transformer-XL). But as we mentioned in the Pretraining subsection of the Bidirectional encoder representations from transformers section, the [MASK] token introduces a discrepancy between the pretraining and fine-tuning steps. To overcome these limitations, the authors of transformer-XL propose XLNet (see XLNet: Generalized Autoregressive Pretraining for Language Understanding at https://arxiv.org/abs/1906.08237): a generalized autoregressive pretraining mechanism that enables learning bidirectional contexts by maximizing the expected likelihood over all ...
XLNet
Get Advanced Deep Learning with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.