Chapter 3. Advanced Indexing

So far, we’ve taken a black-box approach to Ferret. This chapter explains what is really going on during indexing and, in the process, explains how to tune your index for maximum performance. We conclude by explaining how locking works. It is crucial that you understand this, particularly if you want to run Ferret in a multithreaded or multiprocess environment.

How the Indexing Process Works

We are now going to show how a source document—such as an HTML document from the Web, a row from a database, or an image from your personal image collection—becomes a Ferret document stored in the index. Ferret is agnostic about the source document’s type. It doesn’t matter whether you are indexing an MP3 file, a text document, or your store’s product, Ferret treats it as a collection of string fields. So, the first step is to turn source documents into Documents. This is pretty easy with plain-text documents. With other text document types, such as PDF or HTML, you’ll need to write a parser/reader that extracts the searchable text from the documents. For an image file, you might have a parser that extracts EXIF tags. Database rows usually map pretty easily to Documents. See Chapter 6 for a framework for doing exactly this.

Once you have a Document, you add it to an IndexWriter. This is where the magic begins. The Document’s fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however it sees fit (see ...

Get Ferret now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.