XLNet
The authors note that bidirectional models with denoising autoencoding pretraining (such as BERT) achieve better performance compared to unidirectional autoregressive models (such as transformer-XL). But as we mentioned in the Pretraining subsection of the Bidirectional encoder representations from transformers section, the [MASK] token introduces a discrepancy between the pretraining and fine-tuning steps. To overcome these limitations, the authors of transformer-XL propose XLNet (see XLNet: Generalized Autoregressive Pretraining for Language Understanding at https://arxiv.org/abs/1906.08237): a generalized autoregressive pretraining mechanism that enables learning bidirectional contexts by maximizing the expected likelihood over all ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access