Chapter 2. Foundations of taming text

In this chapter

  • Understanding text processing building blocks like tokenizing, chunking, parsing, and part of speech tagging
  • Extracting text from common file formats using the Apache Tika open source project

Naturally, before we can get started with the hard-core text-taming processes, we need a little warm-up first. We’ll start by laying the ground work with a short high school English refresher where we’ll delve into topics such as tokenization, stemming, parts of speech, and phrases and clauses. Each of these steps can play an important role in the quality of results you’ll see when building applications utilizing text. For instance, the seemingly simple act of splitting up words, especially in languages ...

Get Taming Text: How to Find, Organize, and Manipulate It now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.