Tokenization is the act of splitting the text into words. Chunking whitespace seems very easy, but it's not, because the text contains punctuation and contractions. Let's start with an example:
In: my_text = "The coolest job in the next 10 years will be " +\ "statisticians. People think I'm joking, but " +\ "who would've guessed that computer engineers " +\ "would've been the coolest job of the 1990s?" simple_tokens = my_text.split(' ') print (simple_tokens)Out: ['The', 'coolest', 'job', 'in', 'the', 'next', '10', 'years', 'will', 'be', 'statisticians.', 'People', 'think', "I'm", 'joking,', 'but', 'who', "would've", 'guessed', 'that', 'computer', 'engineers', "would've", 'been', 'the', 'coolest', 'job', 'of', 'the', '1990s?'] ...