September 2024
Beginner to intermediate
368 pages
9h 49m
English
The complete code examples for the exercises’ answers can be found in the supplementary GitHub repository at https://github.com/rasbt/LLMs-from-scratch.
You can obtain the individual token IDs by prompting the encoder with one string at a time:
print(tokenizer.encode("Ak"))
print(tokenizer.encode("w"))
# ...
This prints
[33901] [86] # ...
You can then use the following code to assemble the original string:
print(tokenizer.decode([33901, 86, 343, 86, 220, 959]))
This returns
'Akwirw ier'
The code for the data loader with max_length=2 and stride=2:
dataloader = create_dataloader(
raw_text, batch_size=4, max_length=2, stride=2
)
It produces ...