Get full access to Advanced Deep Learning with Python and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Pretraining

The pretraining step is illustrated on the left-hand side of the diagram in the Bidirectional encoder representations from transformers section. The authors of the paper trained the BERT model using two unsupervised training tasks: masked language modeling (MLM) and next sentence prediction (NSP).

We'll start with MLM, where the model is presented with an input sequence and its goal is to predict a missing word in that sequence. In this case, BERT acts as a denoising autoencoder in the sense that it tries to reconstruct its intentionally corrupted input. MLM is similar in nature to the CBOW objective of the word2vec model (see Chapter 6, Language Modeling). To solve this task, the BERT encoder output is extended with a fully connected ...

Get Advanced Deep Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now