O'Reilly logo

Machine Learning with Swift by Alexander Sosnovshchenko

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Tokenization

Tokens in linguistics are different from the authorization tokens were used to. They are linguistic units: words are tokens, numbers and punctuation marks are tokens, and sentences are tokens. In other words, they are discrete pieces of information or meaning. Tokenization is a process of splitting text into lexical tokens. Sentence tokenizers split texts into sentences, and word tokenizers split them further into separate words, punctuation marks, and so on. This task may seem simple (there is a regexp for that!), but this impression is deceptive. Here are a few problems to consider:

  • How to tokenize words with a hyphen or an apostrophe, for example, New York-based or you're?
  • How to tokenize web addresses and emails, for example,  ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required