Summary
In this chapter, we illustrated various approaches to tokenize text and perform normalization on text. We started with simple tokenization technique based on core Java classes such as the String class' split method and the StringTokenizer class. These approaches can be useful when we decide to forgo the use of NLP API classes.
We demonstrated how tokenization can be performed using the OpenNLP, Stanford, and LingPipe APIs. We found there are variations in how tokenization can be performed and in options that can be applied in these APIs. A brief comparison of their outputs was provided.
Normalization was discussed, which can involve converting characters to lowercase, expanding abbreviation, removing stopwords, stemming, and lemmatization. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access