Tokens in linguistics are different from the authorization tokens were used to. They are linguistic units: words are tokens, numbers and punctuation marks are tokens, and sentences are tokens. In other words, they are discrete pieces of information or meaning. Tokenization is a process of splitting text into lexical tokens. Sentence tokenizers split texts into sentences, and word tokenizers split them further into separate words, punctuation marks, and so on. This task may seem simple (there is a regexp for that!), but this impression is deceptive. Here are a few problems to consider:
- How to tokenize words with a hyphen or an apostrophe, for example, New York-based or you're?
- How to tokenize web addresses and emails, for example, ...