Chapter 2. Foundations of taming text
In this chapter
- Understanding text processing building blocks like tokenizing, chunking, parsing, and part of speech tagging
- Extracting text from common file formats using the Apache Tika open source project
Naturally, before we can get started with the hard-core text-taming processes, we need a little warm-up first. We’ll start by laying the ground work with a short high school English refresher where we’ll delve into topics such as tokenization, stemming, parts of speech, and phrases and clauses. Each of these steps can play an important role in the quality of results you’ll see when building applications utilizing text. For instance, the seemingly simple act of splitting up words, especially in languages ...