Chapter 2. Finding Parts of Text
Finding parts of text is concerned with breaking text down into individual units called tokens, and optionally performing additional processing on these tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.
We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access