Chapter 11. Extending Tika

 

This chapter covers

 

There are thousands of document formats in the world and new ones are constantly being introduced, so it’s impossible for a library like Tika to support all of them out of the box. Thus even though each Tika version adds support for new formats, there will be times when Tika won’t be able to extract content from or even detect the type of a document you’re trying to use. This chapter is about what you can do in such a situation.

Imagine that you’re working with a new XML-based file format for medical prescriptions. Each file describes a single prescription and consists of a set of both fixed and free-form fields ...

Get Tika in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.