Tokenization

Given a sentence, splitting it into either characters or words is called tokenization. There are libraries, such as spaCy, that offer complex solutions to tokenization. Let's use simple Python functions such as split and list to convert the text into tokens.

To demonstrate how tokenization works on characters and words, let's consider a small review of the movie Thor: Ragnarok. We will work with the following text:

The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sh*t in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that.

Get Deep Learning with PyTorch now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.