O'Reilly logo

Scaling Apache Solr by Hrishikesh Vijay Karambelkar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Working with rich documents

We have seen how Apache Solr has inbuilt handlers for CSV, JSON, and XML formats in the last section. In any content management system of an organization, a data item may be residing in documents which are in different formats, such as PDF, DOC, PPT, XLS. The biggest challenge with these types is, they are all semi-structured forms. Interestingly, Apache Solr handles many of these formats directly, and it is capable of extracting the information from these types of data sources, thanks to Apache Tika! Apache Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself.

Note

The framework to extract content ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required