O'Reilly logo

Tika in Action by Jukka Zitting, Chris Mattmann

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 5. Content extraction

 

This chapter covers

  • Full-text extraction
  • Working with the Parser interface
  • Reading data from a stream
  • Exporting in XHTML format

 

Armed with Tika, you can be confident of knowing each document’s pedigree, so sorting and organizing documents will be a snap. But what do you plan on doing with those documents once they’re organized?

Interactively, you’d likely pull the documents into your favorite editing application and start reading and updating their internal text. Programmatically, you’re more than likely to do the same thing, and once you know what’s what in terms of document types, and what applications are associated with them (like we showed you in chapter 4), you can make sure you’re using the right ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required