February 2018
Intermediate to advanced
378 pages
10h 14m
English
Tokens in linguistics are different from the authorization tokens were used to. They are linguistic units: words are tokens, numbers and punctuation marks are tokens, and sentences are tokens. In other words, they are discrete pieces of information or meaning. Tokenization is a process of splitting text into lexical tokens. Sentence tokenizers split texts into sentences, and word tokenizers split them further into separate words, punctuation marks, and so on. This task may seem simple (there is a regexp for that!), but this impression is deceptive. Here are a few problems to consider:
Read now
Unlock full access