book

Build a Large Language Model (From Scratch)

by Sebastian Raschka

September 2024

Beginner to intermediate

368 pages

9h 49m

English

Manning Publications

Read now

Unlock full access

1.1 What is an LLM?1.2 Applications of LLMs1.3 Stages of building and using LLMs1.4 Introducing the transformer architecture1.5 Utilizing large datasets1.6 A closer look at the GPT architecture1.7 Building a large language model
2.1 Understanding word embeddings2.2 Tokenizing text2.3 Converting tokens into token IDs2.4 Adding special context tokens2.5 Byte pair encoding2.6 Data sampling with a sliding window2.7 Creating token embeddings2.8 Encoding word positions

3.1 The problem with modeling long sequences3.2 Capturing data dependencies with attention mechanisms3.3 Attending to different parts of the input with self-attention3.3.1 A simple self-attention mechanism without trainable weights3.3.2 Computing attention weights for all input tokens3.4 Implementing self-attention with trainable weights3.4.1 Computing the attention weights step by step3.4.2 Implementing a compact self-attention Python class3.5 Hiding future words with causal attention3.5.1 Applying a causal attention mask3.5.2 Masking additional attention weights with dropout3.5.3 Implementing a compact causal attention class3.6 Extending single-head attention to multi-head attention3.6.1 Stacking multiple single-head attention layers3.6.2 Implementing multi-head attention with weight splits
4.1 Coding an LLM architecture4.2 Normalizing activations with layer normalization4.3 Implementing a feed forward network with GELU activations4.4 Adding shortcut connections4.5 Connecting attention and linear layers in a transformer block4.6 Coding the GPT model4.7 Generating text
5.1 Evaluating generative text models5.1.1 Using GPT to generate text5.1.2 Calculating the text generation loss5.1.3 Calculating the training and validation set losses5.2 Training an LLM5.3 Decoding strategies to control randomness5.3.1 Temperature scaling5.3.2 Top-k sampling5.3.3 Modifying the text generation function5.4 Loading and saving model weights in PyTorch5.5 Loading pretrained weights from OpenAI
6.1 Different categories of fine-tuning6.2 Preparing the dataset6.3 Creating data loaders6.4 Initializing a model with pretrained weights6.5 Adding a classification head6.6 Calculating the classification loss and accuracy6.7 Fine-tuning the model on supervised data6.8 Using the LLM as a spam classifier
7.1 Introduction to instruction fine-tuning7.2 Preparing a dataset for supervised instruction fine-tuning7.3 Organizing data into training batches7.4 Creating data loaders for an instruction dataset7.5 Loading a pretrained LLM7.6 Fine-tuning the LLM on instruction data7.7 Extracting and saving responses7.8 Evaluating the fine-tuned LLM7.9 Conclusions7.9.1 What’s next?7.9.2 Staying up to date in a fast-moving field7.9.3 Final words
A.1 What is PyTorch?A.1.1 The three core components of PyTorchA.1.2 Defining deep learningA.1.3 Installing PyTorchA.2 Understanding tensorsA.2.1 Scalars, vectors, matrices, and tensorsA.2.2 Tensor data typesA.2.3 Common PyTorch tensor operationsA.3 Seeing models as computation graphsA.4 Automatic differentiation made easyA.5 Implementing multilayer neural networksA.6 Setting up efficient data loadersA.7 A typical training loopA.8 Saving and loading modelsA.9 Optimizing training performance with GPUsA.9.1 PyTorch computations on GPU devicesA.9.2 Single-GPU trainingA.9.3 Training with multiple GPUs
D.1 Learning rate warmupD.2 Cosine decayD.3 Gradient clippingD.4 The modified training function
E.1 Introduction to LoRAE.2 Preparing the datasetE.3 Initializing the modelE.4 Parameter-efficient fine-tuning with LoRA

Content preview from Build a Large Language Model (From Scratch)

appendix D Adding bells and whistles to the training loop

In this appendix, we enhance the training function for the pretraining and fine-tuning processes covered in chapters 5 to 7. In particular, it covers learning rate warmup, cosine decay, and gradient clipping. We then incorporate these techniques into the training function and pretrain an LLM.

To make the code self-contained, we reinitialize the model we trained in chapter 5:

import torch
from chapter04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,          #1

    "context_length": 256,       #2
    "emb_dim": 768,           #3
    "n_heads": 12,            #4
    "n_layers": 12,           #5
    "drop_rate": 0.1,         #6
    "qkv_bias": False         #7 } device = torch.device("cuda" if torch.cuda.is_available() else "cpu") torch.manual_seed(123) ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Build a Large Language Model (From Scratch)

Sebastian Raschka

Hands-On Large Language Models

Jay Alammar, Maarten Grootendorst

Quick Start Guide to Large Language Models: Strategies and Best Practices for ChatGPT, Embeddings, Fine-Tuning, and Multimodal AI, 2nd Edition

Sinan Ozdemir

Deep Learning for Coders with fastai and PyTorch

Jeremy Howard, Sylvain Gugger

Publisher Resources

ISBN: 9781633437166Supplemental Content Publisher Support Other Publisher Website Supplemental Content Errata Page Purchase Link

appendix D Adding bells and whistles to the training loop

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Build a Large Language Model (From Scratch)

Hands-On Large Language Models

Quick Start Guide to Large Language Models: Strategies and Best Practices for ChatGPT, Embeddings, Fine-Tuning, and Multimodal AI, 2nd Edition

Deep Learning for Coders with fastai and PyTorch

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.