book

Build a Large Language Model (From Scratch)

Name: Build a Large Language Model (From Scratch)
Author: Sebastian Raschka
ISBN: 9781633437166

by Sebastian Raschka

September 2024

Beginner to intermediate

368 pages

9h 38m

English

Manning Publications

Read now

Unlock full access

Build a Large Language Model (From Scratch)
copyright
contents
preface
acknowledgments
about this book
about the author
about the cover illustration
1 Understanding large language models
1.1 What is an LLM?1.2 Applications of LLMs1.3 Stages of building and using LLMs1.4 Introducing the transformer architecture1.5 Utilizing large datasets1.6 A closer look at the GPT architecture1.7 Building a large language model
2 Working with text data
2.1 Understanding word embeddings2.2 Tokenizing text2.3 Converting tokens into token IDs2.4 Adding special context tokens2.5 Byte pair encoding2.6 Data sampling with a sliding window2.7 Creating token embeddings2.8 Encoding word positions

3 Coding attention mechanisms
3.1 The problem with modeling long sequences3.2 Capturing data dependencies with attention mechanisms3.3 Attending to different parts of the input with self-attention3.3.1 A simple self-attention mechanism without trainable weights3.3.2 Computing attention weights for all input tokens3.4 Implementing self-attention with trainable weights3.4.1 Computing the attention weights step by step3.4.2 Implementing a compact self-attention Python class3.5 Hiding future words with causal attention3.5.1 Applying a causal attention mask3.5.2 Masking additional attention weights with dropout3.5.3 Implementing a compact causal attention class3.6 Extending single-head attention to multi-head attention3.6.1 Stacking multiple single-head attention layers3.6.2 Implementing multi-head attention with weight splits
4 Implementing a GPT model from scratch to generate text
4.1 Coding an LLM architecture4.2 Normalizing activations with layer normalization4.3 Implementing a feed forward network with GELU activations4.4 Adding shortcut connections4.5 Connecting attention and linear layers in a transformer block4.6 Coding the GPT model4.7 Generating text
5 Pretraining on unlabeled data
5.1 Evaluating generative text models5.1.1 Using GPT to generate text5.1.2 Calculating the text generation loss5.1.3 Calculating the training and validation set losses5.2 Training an LLM5.3 Decoding strategies to control randomness5.3.1 Temperature scaling5.3.2 Top-k sampling5.3.3 Modifying the text generation function5.4 Loading and saving model weights in PyTorch5.5 Loading pretrained weights from OpenAI
6 Fine-tuning for classification
6.1 Different categories of fine-tuning6.2 Preparing the dataset6.3 Creating data loaders6.4 Initializing a model with pretrained weights6.5 Adding a classification head6.6 Calculating the classification loss and accuracy6.7 Fine-tuning the model on supervised data6.8 Using the LLM as a spam classifier
7 Fine-tuning to follow instructions
7.1 Introduction to instruction fine-tuning7.2 Preparing a dataset for supervised instruction fine-tuning7.3 Organizing data into training batches7.4 Creating data loaders for an instruction dataset7.5 Loading a pretrained LLM7.6 Fine-tuning the LLM on instruction data7.7 Extracting and saving responses7.8 Evaluating the fine-tuned LLM7.9 Conclusions7.9.1 What’s next?7.9.2 Staying up to date in a fast-moving field7.9.3 Final words
appendix A Introduction to PyTorch
A.1 What is PyTorch?A.1.1 The three core components of PyTorchA.1.2 Defining deep learningA.1.3 Installing PyTorchA.2 Understanding tensorsA.2.1 Scalars, vectors, matrices, and tensorsA.2.2 Tensor data typesA.2.3 Common PyTorch tensor operationsA.3 Seeing models as computation graphsA.4 Automatic differentiation made easyA.5 Implementing multilayer neural networksA.6 Setting up efficient data loadersA.7 A typical training loopA.8 Saving and loading modelsA.9 Optimizing training performance with GPUsA.9.1 PyTorch computations on GPU devicesA.9.2 Single-GPU trainingA.9.3 Training with multiple GPUs
appendix B References and further reading
appendix C Exercise solutions
appendix D Adding bells and whistles to the training loop
D.1 Learning rate warmupD.2 Cosine decayD.3 Gradient clippingD.4 The modified training function
appendix E Parameter-efficient fine-tuning with LoRA
E.1 Introduction to LoRAE.2 Preparing the datasetE.3 Initializing the modelE.4 Parameter-efficient fine-tuning with LoRA

Content preview from Build a Large Language Model (From Scratch)

3 Coding attention mechanisms

This chapter covers

The reasons for using attention mechanisms in neural networks
A basic self-attention framework, progressing to an enhanced self-attention mechanism
A causal attention module that allows LLMs to generate one token at a time
Masking randomly selected attention weights with dropout to reduce overfitting
Stacking multiple causal attention modules into a multi-head attention module

At this point, you know how to prepare the input text for training LLMs by splitting text into individual word and subword tokens, which can be encoded into vector representations, embeddings, for the LLM.

Now, we will look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Build a Large Language Model (From Scratch)

Publisher Resources

ISBN: 9781633437166Publisher Support Other Publisher Website Errata Page Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Build a Large Language Model (From Scratch)

by Sebastian Raschka