Indexing binary content on the server (Intermediate)

If Solr could only index structured documents, it would be leaving vast majority of possible content untouched. Fortunately, with the help of another Apache open source project—Apache Tika—Solr can also index binary content. Whether it is a PDF document, an MS Word or OpenOffice document, an image, or even a song, it can be indexed into Solr.

Of course, it makes no sense to just load binary content into Solr. Instead, Tika parses binary formats, extracts available metadata and, in some cases, textual content, and makes it available to Solr. In case of pseudo-binary documents, such as the latest MS Word or OpenOffice formats, quite a considerable amount of information is available. For images ...

Get Instant Apache Solr for Indexing Data How-to now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.