2 Working with text data

This chapter covers

  • Preparing text for large language model training
  • Splitting text into word and subword tokens
  • Byte pair encoding as a more advanced way of tokenizing text
  • Sampling training examples with a sliding window approach
  • Converting tokens into vectors that feed into a large language model

So far, we’ve covered the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs.

During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters ...

Get Build a Large Language Model (From Scratch) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.