appendix C Exercise solutions

The complete code examples for the exercises’ answers can be found in the supplementary GitHub repository at https://github.com/rasbt/LLMs-from-scratch.

Chapter 2

You can obtain the individual token IDs by prompting the encoder with one string at a time:

print(tokenizer.encode("Ak"))
print(tokenizer.encode("w"))
# ...

This prints

[33901]
[86]
# ...

You can then use the following code to assemble the original string:

print(tokenizer.decode([33901, 86, 343, 86, 220, 959]))

This returns

'Akwirw ier'

The code for the data loader with max_length=2 and stride=2:

dataloader = create_dataloader(
    raw_text, batch_size=4, max_length=2, stride=2
)

It produces ...

Get Build a Large Language Model (From Scratch) now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.