9 ULMFiT and knowledge distillation adaptation strategies

This chapter covers

  • Implementing the strategies of discriminative fine-tuning and gradual unfreezing
  • Executing knowledge distillation between teacher and student BERT models

In this chapter and the following chapter, we will cover some adaptation strategies for the deep NLP transfer learning modeling architectures that we have covered so far. In other words, given a pretrained architecture such as ELMo, BERT, or GPT, how can we carry out transfer learning more efficiently? We can employ several measures of efficiency here. We choose to focus on parameter efficiency, where the goal is to yield a model with the fewest parameters possible while suffering minimal reduction in performance. ...

Get Transfer Learning for Natural Language Processing now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.