Tokenizing

Tokenizing is the act of transforming an input string, such as a sentence, paragraph, or even an object such as an email, into individual tokens. A very simple tokenizer might take a sentence or paragraph and split it by spaces, thus generating tokens that are individual words. However, tokens do not necessarily need to be words, nor does every word in an input string need to be returned by the tokenizer, nor does every token generated by the tokenizer need to be present in the original text, nor does a token need to represent only one word. We therefore use the term token rather than word to describe the output of a tokenizer, as tokens are not always words.

The manner in which you tokenize text before processing it with an ML ...

Get Hands-on Machine Learning with JavaScript now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.