Chapter 3. LLM Ingredients: Tokenization, Learning Objectives & Architectures

In Chapter 3, we dug into the datasets that are used to train the language models of today. Hopefully this foray has underscored how influential pre-training data is to the resulting model. In this chapter, we will go through the remaining ingredients: vocabulary and tokenization, learning objectives, and model architecture.

Vocabulary and Tokenization

What do you do first when you start learning a new language? You start acquiring its vocabulary, expanding it as you gain more proficiency in the language. Let’s define vocabulary here as

All the words in a language that are understood by a specific person

The average native English speaker is said to have a vocabulary ranging between 20,000-35,000 words. Similarly, every language model has its own vocabulary, with most vocabulary sizes ranging anywhere between 5,000 to 500,000 tokens.

As an example, ...

Get Designing Large Language Model Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.