Tokenization

Tokens in linguistics are different from the authorization tokens were used to. They are linguistic units: words are tokens, numbers and punctuation marks are tokens, and sentences are tokens. In other words, they are discrete pieces of information or meaning. Tokenization is a process of splitting text into lexical tokens. Sentence tokenizers split texts into sentences, and word tokenizers split them further into separate words, punctuation marks, and so on. This task may seem simple (there is a regexp for that!), but this impression is deceptive. Here are a few problems to consider:

  • How to tokenize words with a hyphen or an apostrophe, for example, New York-based or you're?
  • How to tokenize web addresses and emails, for example,  ...

Get Machine Learning with Swift now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.