The pretraining step is illustrated on the left-hand side of the diagram in the Bidirectional encoder representations from transformers section. The authors of the paper trained the BERT model using two unsupervised training tasks: masked language modeling (MLM) and next sentence prediction (NSP).
We'll start with MLM, where the model is presented with an input sequence and its goal is to predict a missing word in that sequence. In this case, BERT acts as a denoising autoencoder in the sense that it tries to reconstruct its intentionally corrupted input. MLM is similar in nature to the CBOW objective of the word2vec model (see Chapter 6, Language Modeling). To solve this task, the BERT encoder output is extended with a fully connected ...