So far, we’ve taken a black-box approach to Ferret. This chapter explains what is really going on during indexing and, in the process, explains how to tune your index for maximum performance. We conclude by explaining how locking works. It is crucial that you understand this, particularly if you want to run Ferret in a multithreaded or multiprocess environment.
We are now going to show how a source document—such as an HTML
document from the Web, a row from a database, or an image from your
personal image collection—becomes a Ferret document stored in the index.
Ferret is agnostic about the source document’s type. It doesn’t matter
whether you are indexing an MP3 file, a text document, or your store’s
product, Ferret treats it as a collection of string fields. So, the first
step is to turn source documents into
Documents. This is pretty easy with plain-text
documents. With other text document types, such as PDF or HTML, you’ll
need to write a parser/reader that extracts the searchable text from the
documents. For an image file, you might have a parser that extracts EXIF
tags. Database rows usually map pretty easily to
Documents. See Chapter 6 for a framework for doing exactly
Once you have a
Document, you add
it to an
IndexWriter. This is
where the magic begins. The
Document’s fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however it sees fit (see ...