Tokenization

Given a sentence, splitting it into either characters or words is called tokenization. There are libraries, such as spaCy, that offer complex solutions to tokenization. Let's use simple Python functions such as split and list to convert the text into tokens.

To demonstrate how tokenization works on characters and words, let's consider a small review of the movie Thor: Ragnarok. We will work with the following text:

The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sh*t in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that.

Get Deep Learning with PyTorch now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.