Tuning Indexing Performance

Ferret’s indexing performance is lightning-fast out of the box, so you’re justified in wondering whether you need to know how to make Ferret even faster. In most cases, you won’t need Ferret to go any faster than it already does. But if you are indexing gigabytes rather than megabytes and the indexing process is taking hours rather than seconds, you need to know how to push Ferret to its limits.

In-Memory Indexing

People who have used Lucene or earlier versions of Ferret might try to improve indexing speed by indexing to a RAMDirectory and then flushing the RAMDirectory to disk. That trick is now pointless; Ferret automatically indexes as many documents as it can in memory before flushing them to the Directory. You can ensure that all the indexing is done in memory by setting the parameters :max_buffered_docs and :max_buffer_memory to sufficiently large quantities.

Indexing Parameters

The indexing process is regulated by the parameters, shown with their defaults in Table 3-1.

Table 3-1. Index parameters

ParameterDefaultShort description
:max_buffer_memory 16 MbThe maximum memory used by the IndexWriter before buffered documents are flushed to the index
:chunk_size 1 MbThe size of the memory chunks allocated to the memory pool during indexing
:merge_factor 10The minimum number of similar sized segments needed to trigger a merge
:max_buffered_docs 10,000The maximum number of documents that will be buffered by the IndexWriter before they are flushed to the index ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.