Tokenizing Text

We end this chapter with an extended (and more complex) example in three parts. Example 2-8 is a listing of Tokenizer.java. This Tokenizer interface defines an API for tokenizing text. Tokenizing simply means breaking into chunks; tokenizers are also known as lexers or scanners, and are commonly used when writing parsers. This Tokenizer interface is intended to provide an alternative to java.util.StringTokenizer, which is too simple for many uses, and java.io.StreamTokenizer, which is complex and poorly documented.

As an interface, Tokenizer doesn’t do anything itself. But Example 2-8 is followed by an implementation in Examples Example 2-9 and Example 2-10. Following a pattern that you’ll also see frequently in Java platform APIs, the implementation is broken into two classes: AbstractTokenizer, an abstract class that implements Tokenizer and implements its methods in terms of a small number of abstract methods, followed by CharSequenceTokenizer , a concrete subclass for tokenizing String and StringBuffer (or any CharSequence) objects. To demonstrate the flexibility of this implementation scheme, we’ll see other Tokenizer implementations based on AbstractTokenizer throughout this book. ReaderTokenizer (for tokenizing character streams) is defined in Example 3-7, ChannelTokenizer (for tokenizing text read from high-performance “channels” of the New I/O API) is defined in Example 6-8, and MappedFileTokenizer (for tokenizing memory-mapped files) is defined in Example ...

Get Java Examples in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.