How it works...
To build any language model, we need to clean the input text and break it into tokens. Tokens are individual words, and breaking text into its different words is called tokenization. By default, the keras tokenizer splits the corpus into a list of tokens (" " is used for splitting sentences into words), removes all punctuation, converts the words into lowercase, and builds an internal vocabulary based on the input text. The vocabulary that's generated by the tokenizer is an indexed list where the words are indexed by their overall frequency in the dataset. In this recipe, we saw that in the nursery rhyme, "and" is the most frequent word, while "up" is the 5th most frequent word. There are 37 unique words in total.
In step ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access