book

Build a Large Language Model (From Scratch)

by Sebastian Raschka

September 2024

Beginner to intermediate

368 pages

9h 49m

English

Manning Publications

Read now

Unlock full access

1.1 What is an LLM?1.2 Applications of LLMs1.3 Stages of building and using LLMs1.4 Introducing the transformer architecture1.5 Utilizing large datasets1.6 A closer look at the GPT architecture1.7 Building a large language model
2.1 Understanding word embeddings2.2 Tokenizing text2.3 Converting tokens into token IDs2.4 Adding special context tokens2.5 Byte pair encoding2.6 Data sampling with a sliding window2.7 Creating token embeddings2.8 Encoding word positions

3.1 The problem with modeling long sequences3.2 Capturing data dependencies with attention mechanisms3.3 Attending to different parts of the input with self-attention3.3.1 A simple self-attention mechanism without trainable weights3.3.2 Computing attention weights for all input tokens3.4 Implementing self-attention with trainable weights3.4.1 Computing the attention weights step by step3.4.2 Implementing a compact self-attention Python class3.5 Hiding future words with causal attention3.5.1 Applying a causal attention mask3.5.2 Masking additional attention weights with dropout3.5.3 Implementing a compact causal attention class3.6 Extending single-head attention to multi-head attention3.6.1 Stacking multiple single-head attention layers3.6.2 Implementing multi-head attention with weight splits
4.1 Coding an LLM architecture4.2 Normalizing activations with layer normalization4.3 Implementing a feed forward network with GELU activations4.4 Adding shortcut connections4.5 Connecting attention and linear layers in a transformer block4.6 Coding the GPT model4.7 Generating text
5.1 Evaluating generative text models5.1.1 Using GPT to generate text5.1.2 Calculating the text generation loss5.1.3 Calculating the training and validation set losses5.2 Training an LLM5.3 Decoding strategies to control randomness5.3.1 Temperature scaling5.3.2 Top-k sampling5.3.3 Modifying the text generation function5.4 Loading and saving model weights in PyTorch5.5 Loading pretrained weights from OpenAI
6.1 Different categories of fine-tuning6.2 Preparing the dataset6.3 Creating data loaders6.4 Initializing a model with pretrained weights6.5 Adding a classification head6.6 Calculating the classification loss and accuracy6.7 Fine-tuning the model on supervised data6.8 Using the LLM as a spam classifier
7.1 Introduction to instruction fine-tuning7.2 Preparing a dataset for supervised instruction fine-tuning7.3 Organizing data into training batches7.4 Creating data loaders for an instruction dataset7.5 Loading a pretrained LLM7.6 Fine-tuning the LLM on instruction data7.7 Extracting and saving responses7.8 Evaluating the fine-tuned LLM7.9 Conclusions7.9.1 What’s next?7.9.2 Staying up to date in a fast-moving field7.9.3 Final words
A.1 What is PyTorch?A.1.1 The three core components of PyTorchA.1.2 Defining deep learningA.1.3 Installing PyTorchA.2 Understanding tensorsA.2.1 Scalars, vectors, matrices, and tensorsA.2.2 Tensor data typesA.2.3 Common PyTorch tensor operationsA.3 Seeing models as computation graphsA.4 Automatic differentiation made easyA.5 Implementing multilayer neural networksA.6 Setting up efficient data loadersA.7 A typical training loopA.8 Saving and loading modelsA.9 Optimizing training performance with GPUsA.9.1 PyTorch computations on GPU devicesA.9.2 Single-GPU trainingA.9.3 Training with multiple GPUs
D.1 Learning rate warmupD.2 Cosine decayD.3 Gradient clippingD.4 The modified training function
E.1 Introduction to LoRAE.2 Preparing the datasetE.3 Initializing the modelE.4 Parameter-efficient fine-tuning with LoRA

Content preview from Build a Large Language Model (From Scratch)

2 Working with text data

This chapter covers

Preparing text for large language model training
Splitting text into word and subword tokens
Byte pair encoding as a more advanced way of tokenizing text
Sampling training examples with a sliding window approach
Converting tokens into vectors that feed into a large language model

So far, we’ve covered the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs.

During the pretraining stage, LLMs process text, one word at a time. Training LLMs with millions to billions of parameters ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial