appendix D Adding bells and whistles to the training loop

In this appendix, we enhance the training function for the pretraining and fine-tuning processes covered in chapters 5 to 7. In particular, it covers learning rate warmup, cosine decay, and gradient clipping. We then incorporate these techniques into the training function and pretrain an LLM.

To make the code self-contained, we reinitialize the model we trained in chapter 5:

import torch
from chapter04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,          #1

    "context_length": 256,       #2
    "emb_dim": 768,           #3
    "n_heads": 12,            #4
    "n_layers": 12,           #5
    "drop_rate": 0.1,         #6
    "qkv_bias": False         #7 } device = torch.device("cuda" if torch.cuda.is_available() else "cpu") torch.manual_seed(123) ...

Get Build a Large Language Model (From Scratch) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Build a Large Language Model (From Scratch) by Sebastian Raschka

appendix D Adding bells and whistles to the training loop

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly