2 Working with text data

This chapter covers

Preparing text for large language model training
Splitting text into word and subword tokens
Byte pair encoding as a more advanced way of tokenizing text
Sampling training examples with a sliding window approach
Converting tokens into vectors that feed into a large language model

So far, we’ve covered the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs.

During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters ...

Get Build a Large Language Model (From Scratch) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Build a Large Language Model (From Scratch) by Sebastian Raschka

2 Working with text data

This chapter covers

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly