Skip to Main Content
Natural Language Processing with Java - Second Edition
book

Natural Language Processing with Java - Second Edition

by Richard M. Reese, AshishSingh Bhatia
July 2018
Beginner to intermediate content levelBeginner to intermediate
318 pages
7h 49m
English
Packt Publishing
Content preview from Natural Language Processing with Java - Second Edition

Training a tokenizer to find parts of text

Training a tokenizer is useful when we encounter text that is not handled well by standard tokenizers. Instead of writing a custom tokenizer, we can create a tokenizer model that can be used to perform the tokenization.

To demonstrate how such a model can be created, we will read training data from a file and then train a model using this data. The data is stored as a series of words separated by whitespace and <SPLIT> fields. This <SPLIT> field is used to provide further information about how tokens should be identified. They can help identify breaks between numbers, such as 23.6, and punctuation characters, such as commas. The training data we will use is stored in the training-data.train file, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Natural Language Processing with Java Cookbook

Natural Language Processing with Java Cookbook

Richard M. Reese, Richard M Reese
Natural Language Processing in Action

Natural Language Processing in Action

Cole Howard, Hobson Lane, Hannes Hapke
Natural Language Processing with Python

Natural Language Processing with Python

Steven Bird, Ewan Klein, Edward Loper

Publisher Resources

ISBN: 9781788993494Supplemental Content