O'Reilly logo

Natural Language Processing with Java by Richard M Reese

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

NLP tokenizer APIs

In this section, we will demonstrate several different tokenization techniques using the OpenNLP, Stanford, and LingPipe APIs. Although there are a number of other APIs available, we restricted the demonstration to these APIs. The examples will give you an idea of what techniques are available.

We will use a string called paragraph to illustrate these techniques. The string includes a new line break that may occur in real text in unexpected places. It is defined here:

private String paragraph = "Let's pause, \nand then ++ "reflect.";

Using the OpenNLPTokenizer class

OpenNLP possesses a Tokenizer interface that is implemented by three classes: SimpleTokenizer, TokenizerME, and WhitespaceTokenizer. This interface supports two methods: ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required