December 2019
Intermediate to advanced
468 pages
14h 28m
English
Before going into each training step, let's discuss the input and output data representations, which are shared by the two steps. Somewhat similar to fastText (see Chapter 6, Language Modeling), BERT uses a data-driven tokenization algorithm called WordPiece (https://arxiv.org/abs/1609.08144). This means that, instead of a vocabulary of full words, it creates a vocabulary of subword tokens in an iterative process until that vocabulary reaches a predetermined size (in the case of BERT, the size is 30,000 tokens). This approach has two main advantages:
BERT can handle a ...
Read now
Unlock full access