Chapter 5. Content extraction

 

This chapter covers

  • Full-text extraction
  • Working with the Parser interface
  • Reading data from a stream
  • Exporting in XHTML format

 

Armed with Tika, you can be confident of knowing each document’s pedigree, so sorting and organizing documents will be a snap. But what do you plan on doing with those documents once they’re organized?

Interactively, you’d likely pull the documents into your favorite editing application and start reading and updating their internal text. Programmatically, you’re more than likely to do the same thing, and once you know what’s what in terms of document types, and what applications are associated with them (like we showed you in chapter 4), you can make sure you’re using the right ...

Get Tika in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.