Training a tokenizer to find parts of text

Training a tokenizer is useful when we encounter text that is not handled well by standard tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can be used to perform the tokenization.

To demonstrate how such a model can be created, we will read training data from a file and then train a model using this data. The data is stored as a series of words separated by whitespace and <SPLIT> fields. This <SPLIT> field is used to provide further information about how tokens should be identified. They can help identify breaks between numbers, such as 23.6, and punctuation characters, such as commas. The training data we will use is stored in the training-data.train file, ...

Get Natural Language Processing with Java - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.