Tokenization
Tokenization is the process of breaking down documents into sentences and sentences into words. This is important because it would be a computational nightmare if any computer program attempted to process entire documents as single strings, due to the resource-intensiveness associated with processing strings.
Furthermore, it is very rare that all sentences need to be read at once to be able to understand the meaning of an entire document. Often, each sentence has its own discrete meaning that can be assimilated with other sentences in the document by statistical methods to determine the overall meaning and content of any document.
Again, we often need to break down sentences into words in order to better process the sentence, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access