Representing text data
While our aim is to predict the next word in a given sentence, or (ideally) predict a series of words that make sense and conform to some measure of English syntax/grammar, we will actually be encoding our data at the character level. This means that we need to take our text data (in this example, the collected works of William Shakespeare) and generate a sequence of tokens. These tokens might be whole sentences, individual words, or even characters themselves, depending on what type of model we are training.
Once we've tokenized out text data, we need to turn these tokens into some kind of numeric representation that's amenable to computation. As we've discussed, in our case, these representations are tensors. These ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access