Lucene in Action, Second Edition

Chapter 7. Extracting text with Tika

This chapter covers

Understanding Tika’s logical design
Using Tika’s built-in tool and APIs for text extraction
Parsing XML
Handling known Tika limitations

One of the more mundane yet vital steps when building a search application is extracting text from the documents you need to index. You might be lucky to have an application whose content is already in textual format or whose documents are always the same format, such as XML files or regular rows in a database. If you’re unlucky, you must instead accept the surprisingly wide plethora of document formats that are popular today, such as Outlook, Word, Excel, PowerPoint, Visio, Flash, PDF, Open Office, Rich Text Format (RTF), and even archive file formats ...

Get Lucene in Action, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Lucene in Action, Second Edition by Erik Hatcher, Michael McCandless, Otis Gospodnetic

Chapter 7. Extracting text with Tika

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly