10 ALBERT, adapters, and multitask adaptation strategies
This chapter covers
- Applying embedding factorization and parameter sharing across layers
- Fine-tuning a model from the BERT family on multiple tasks
- Splitting a transfer learning experiment into multiple steps
- Applying adapters to a model from the BERT family
In the previous chapter, we began our coverage of some adaptation strategies for the deep NLP transfer learning modeling architectures that we have covered so far. In other words, given a pretrained architecture such as ELMo, BERT, or GPT, how can transfer learning be carried out more efficiently? We covered two critical ideas behind the method ULMFiT, namely the concepts of discriminative fine-tuning and gradual unfreezing.