By David Balmain
Book Price: $24.99 USD
£15.50 GBP
Cover | Table of Contents | Colophon
make
and a C compiler such as gcc to build the
extension. Other than that, Ferret comes free. You simply need to run the
gem install script:dave$ sudo gem install ferret
irb session, as shown
in .
irb is a great way to play around with Ferret
and try out new things. Next, we’ll show you how to index all the text
files under a particular directory.make
and a C compiler such as gcc to build the
extension. Other than that, Ferret comes free. You simply need to run the
gem install script:dave$ sudo gem install ferret
irb session, as shown
in .
irb is a great way to play around with Ferret
and try out new things. Next, we’ll show you how to index all the text
files under a particular directory.
:path parameter clearly specifies where
you want to store the index. Setting the :create
parameter to true tells Ferret to
create a new index in the specified directory. Any index already residing
in the specified directory will be overwritten, so be careful when setting
:create to true. We saw earlier that we can add simple
Index class. This class
is just a convenient, easy-to-use interface to the rest of the Ferret API.
It does most of the hard work for you, such as parsing queries and keeping
track of the IndexReader and IndexWriter classes behind the scenes, or
knowing when to commit the index so that your search is always on the
latest version of the index. There is a lot more that you can do with the
Index class, and in most cases it will
be all you need. But if you really want to take full advantage of all
Ferret has to offer, you’ll need to find out what is going on behind the
scenes. In , we’ll learn more about how
indexing actually works and how to configure the index for your
application.Ferret::Store::DirectoryFerret::Index::FieldInfoFerret::Index::FieldInfosFerret::FieldFerret::DocumentFerret::Index::IndexWriterFerret::Index::IndexReaderDirectory class: RAMDirectory and FSDirectory. After that, we’ll look at the
building blocks of the Ferret index—the Field and Document classes—and we’ll discuss document and
field boosting. We’ll then look at setting up the index. This involves using
the FieldInfo and FieldInfos class and is
probably the most important topic in this chapter. Finally, we’ll discuss
how the actual indexing process works, at which time you’ll learn when to
use the IndexReader
and IndexWriter
classes. Readers should pay special attention to the ” section in , as it seems to be the biggest problem area
for new users in Ferret.Ferret::Store::Directory. Directory is an class that
specifies how an index should be stored. Ferret comes with two implementations of the
Directory class: Ferret::Store::FSDirectory
for storing the index on the filesystem, and Ferret::Store::RAMDirectory
for storing an index in memory. You’ve used both of these
classes already: RAMDirectory in and FSDirectory in Examples and . The Index class hides all the details, but when you
pass a :path parameter to Index.new, an FSDirectory is used; otherwise, a RAMDirectory is
used.FSDirectory. For the most part, RAMDirectory is used internally during indexing.
There are times, however, when RAMDirectory will come in handy. It is a little
faster than FSDirectory, so if you have
a small index that will fit in your available memory and you need a really
fast search, you might choose to use a RAMDirectory. You can even load an existing
filesystem index into a Ferret::Store::Directory. Directory is an class that
specifies how an index should be stored. Ferret comes with two implementations of the
Directory class: Ferret::Store::FSDirectory
for storing the index on the filesystem, and Ferret::Store::RAMDirectory
for storing an index in memory. You’ve used both of these
classes already: RAMDirectory in and FSDirectory in Examples and . The Index class hides all the details, but when you
pass a :path parameter to Index.new, an FSDirectory is used; otherwise, a RAMDirectory is
used.FSDirectory. For the most part, RAMDirectory is used internally during indexing.
There are times, however, when RAMDirectory will come in handy. It is a little
faster than FSDirectory, so if you have
a small index that will fit in your available memory and you need a really
fast search, you might choose to use a RAMDirectory. You can even load an existing
filesystem index into a RAMDirectory:RAMDirectory is most beneficial
during testing. Each unit test can create a new index, which is automatically
cleaned up at the end of the test.Ferret::Document
class. This class extends Ruby’s Hash
class, adding only a boost
attribute. In fact, as you saw in , documents
can also be Hashes, where the key is
the name of the field and the value is the data stored in the
field.Document class. A document can
represent a PDF or a text document, or it can represent like a movie or a product. Make
note of the formatting we use to distinguish documents from the
Document class.Documents have a boost attribute, but we didn’t say what boost was for. The
boost attribute gives a document a
higher weighting in the results of a search. By using the boost attribute, you can make more important
documents appear higher in the search results. The default boost value is 1.0, so if you have a document
that you consider to be important, you might set its boost to 100.0. Another document that you
consider less might have
its boost set to 0.0001.Ferret::Index::FieldInfo object. A
FieldInfo is an immutable class with
the following properties:namebooststored?compressed?indexed?tokenized?omit_norms?store_term_vectors?store_positions?store_offsets?FieldInfo#name property is
a symbol used to match the FieldInfo
object with a field in a document. FieldInfo#boost is
the default boost that is given to each instance of the field when it is
added to the index. This is where, for example, you would boost the
:title field if you wanted it to have
more weight in the search results than the :content field. The default value for #boost is 1.0.FieldInfo object. For example:
Ferret::Index::IndexReader and Ferret::Index::IndexWriter
classes through the explanations and examples. You’ll learn more about
these two classes in .IndexWriter
class. You’ve already seen how to add documents using the Index class; if you look under the covers,
you’ll find that the Index object is
just using an IndexWriter to write documents to the
index:commit method at
the end. None of the changes you make to an index through an IndexWriter are guaranteed to be applied until
you call either commit, optimize, or close. The commit method simply applies all changes to
the index, leaving the IndexWriter open. close does the same thing, then closes the IndexWriter.
optimize is like commit in that it commits all changes, leaving
the IndexWriter open for more index
modifications. However, it also optimizes the index for searching. This
process can be quite resource-intensive and should be called sparingly,
usually after running a batch indexing process or once a day as a cron
process.IndexReader:IndexWriter to
delete documents as you would expect. But you can also use IndexReader to delete documents. Both delete methods work in slightly different
ways. We’ll start with IndexReader#deleteAnalyzers will strip all numbers and
you’ll end up with an empty field:RangeQuery sorts fields lexicographically, so
while 200 comes before 500, 70 comes after 500. To fix this, pad the numbers to a fixed width by prepending zeros. So
instead of adding 5, 70, and 200, you would add 0005, 0070, and 0200,
and instead of adding 3.45 and 101.95, you would add 0003.45 and
0101.95. This is pretty easy using Ruby’s printf-like notation:
Analyzer so that you no longer need
to think about it when adding .Times to a document; the time is converted to a String using its
to_s method when it is added to the
index:add, get,
update, and delete on a Ferret index. That pretty much
covers everything you need to know about indexing with Ferret.Documents. This is pretty easy with plain-text
documents. With other text document types, such as PDF or HTML, you’ll
need to write a parser/reader that extracts the searchable text from the
documents. For an image file, you might have a parser that extracts EXIF
tags. Database rows usually map pretty easily to Documents. See for a framework for doing exactly
this.Document, you add
it to an IndexWriter. This is
where the magic begins. The Document’s
fields are passed through an analyzer (if they are set to be tokenized)
that breaks up the fields into searchable tokens however it sees fit (see
for more information). Once the Document has passed through the analyzer, it is
buffered until the IndexWriter is ready
to write a new segment to the index. Exactly how many documents are
buffered depends on the IndexWriter’s
:max_buffered_docs and :max_buffer_memory parameters, which are
discussed in the ” section later in this
chapter.IndexWriter hits its
limit for the maximum number of buffered documents, or when it is
committed, it writes the buffered documents to a segment in the Directory. In , the IndexWriter buffers 10 documents before writing
them as a segment. As segments are written to the index, the Documents. This is pretty easy with plain-text
documents. With other text document types, such as PDF or HTML, you’ll
need to write a parser/reader that extracts the searchable text from the
documents. For an image file, you might have a parser that extracts EXIF
tags. Database rows usually map pretty easily to Documents. See for a framework for doing exactly
this.Document, you add
it to an IndexWriter. This is
where the magic begins. The Document’s
fields are passed through an analyzer (if they are set to be tokenized)
that breaks up the fields into searchable tokens however it sees fit (see
for more information). Once the Document has passed through the analyzer, it is
buffered until the IndexWriter is ready
to write a new segment to the index. Exactly how many documents are
buffered depends on the IndexWriter’s
:max_buffered_docs and :max_buffer_memory parameters, which are
discussed in the ” section later in this
chapter.IndexWriter hits its
limit for the maximum number of buffered documents, or when it is
committed, it writes the buffered documents to a segment in the Directory. In , the IndexWriter buffers 10 documents before writing
them as a segment. As segments are written to the index, the IndexWriter maintains a segment stack. Each time
a segment is pushed onto the stack, the IndexWriter checks to see if there are :merge_factor
segments on top of the stack that are all the same size. If there are,
they are popped off the stack, merged, the merged segment is pushed back
onto the stack, and the process is repeated. In , the merge factor is set to 3, so if
there are three 10-document segments on top of the stack, they are merged
into one segment.
Likewise, three 30-document segments are merged one 90-document segment, and so
on.RAMDirectory and
then flushing the RAMDirectory to disk. That trick is now
pointless; Ferret automatically indexes as many documents as it can in
memory before flushing them to the Directory. You can ensure that all the
indexing is done in memory by setting the :max_buffered_docs and :max_buffer_memory to
sufficiently large .| Parameter | Default | Short description |
|---|---|---|
:max_buffer_memory | 16 Mb | The maximum memory used by the IndexWriter before buffered documents
are flushed to the index |
:chunk_size | 1 Mb | The size of the memory chunks allocated to the memory pool during indexing |
:merge_factor | 10 | The minimum number of similar sized segments needed to trigger a merge |
:max_buffered_docs | 10,000 | The maximum number of documents that will be buffered by
the IndexWriter before they are
flushed to the index |
:max_merge_docs | Infinite | The maximum number of documents that will be merged into a |
:max_field_length | 10,000 | The maximum number of terms from any field that will be added to the index |
:use_compound_file | true | Specifies whether or not to write the index in compound file format |
:index_skip_interval | 128 | The skip interval between terms in the term dictionary |
:doc_skip_interval | 16 | The skip interval between document IDs in the term dictionary |
SegmentReader.
Each of those SegmentReaders has its
own term dictionary, term enumerators, document enumerators, etc., so they
can chew up quite a few resources, not to mention the fact that each
reader needs to be searched separately. Therefore, the fewer the segments
the better when searching the index. This is even more important when
running short-lived command-line programs. It takes much more time to read
in an unoptimized index than to read in an optimized index.IndexWriter has an optimize method that minimizes the number of segments in an index, making
the index optimal for searching. The best time to optimize the index is at
the end of a batch indexing session. If, however, you are incrementally
indexing your data—as you might do when indexing a model in a Rails
application—you need to be more careful deciding when to optimize the
index. The optimizing process itself can be quite resource-intensive, and
it prevents any other documents from being added to the index. Thus, it is
certainly not a good idea to optimize the index after each document is
added to the index.IndexWriter open for
adding documents to the index. You will need to close the IndexWriter before committing the deletions and
then reopen the IndexWriter. Why, you
ask? To answer this, you have to understand how index locking
works.IndexWriter and IndexReader. The IndexWriter the write lock as soon as it is
opened and keeps it until it is closed. (Hence, the importance of closing
IndexWriters when you have finished
with them.) As a result, you can only ever have one IndexWriter open on an index at any time, and
you can’t perform any write operations with an IndexReader while there is an IndexWriter open on the index. IndexWriter will also obtain the commit lock
when you optimize, commit, or close the index. Furthermore, IndexWriter will obtain the commit lock
unpredictably during indexing. IndexWriter will commit the index whenever
segments are merged, so it is difficult to predict exactly when it will
obtain the commit lock, except to say that it may happen whenever a
document is added to the index.IndexReader, on the other hand,
acquires the write lock only when a write operation is called. There are
three of these: delete, undelete, and set_norm. The commit lock is acquired only when
you call the IndexReader’s commitIndex class, all
you have to know is the search_each()
method and a little bit of Ferret’s query language and you are set.
However, if you take the time to learn the rest of the search API, you’ll
discover a wealth of opportunities you didn’t even know existed.IndexSearcherQueryQueryParserFilterSortIndexSearcher, as the
name would suggest, is used to search indexes. You can
also use it to highlight and explain query results and read documents
from the index (as you would with IndexReader). To create an IndexSearcher, you need to supply it with an
IndexReader:
Directory or a filesystem path to the
index:IndexSearcher so it will retrieve your result
set. Queries are the fundamental building block of the search
API.Index class, all
you have to know is the search_each()
method and a little bit of Ferret’s query language and you are set.
However, if you take the time to learn the rest of the search API, you’ll
discover a wealth of opportunities you didn’t even know existed.IndexSearcherQueryQueryParserFilterSortIndexSearcher, as the
name would suggest, is used to search indexes. You can
also use it to highlight and explain query results and read documents
from the index (as you would with IndexReader). To create an IndexSearcher, you need to supply it with an
IndexReader:
Directory or a filesystem path to the
index:IndexSearcher so it will retrieve your result
set. Queries are the fundamental building block of the search
API.QueryParser:QueryParser is the magic behind the QueryParser to build all your queries, you’ll
gain a better understanding of how searching works in Ferret by building
each of the queries by hand. We’ll also include the Ferret Query Language
(FQL) syntax for each different type of query as we go. As you read,
you’ll find some queries that you can’t build even using the QueryParser, so it will be useful to learn about
them as well.Query has a boost field. Because you will
usually be combining queries with a BooleanQuery, it
can be useful to give some of those queries a higher weighting than the
other clauses in the BooleanQuery. All Query objects also implement hash and eql?, so they can be used in a HashTable to cache query results.TermQuery is the most basic of all queries and is actually the building
block for most of the other queries (even where you wouldn’t expect it,
like in QueryParser. You’ve
already seen many examples of the Ferret Query Language (FQL) in the
previous section (”),
and you’ll have noticed that most of the queries you can build in code can
be described much more easily in FQL. In this section, we’ll talk about
setting up the QueryParser, and then
we’ll go into more detail about FQL.QueryParser has a number of
parameters, as shown in .| Parameter | Default | Short description |
|---|---|---|
:default_field | :* | The default field to be searched; it can also be an array. |
:analyzer | StandardAnalyzer | Analyzer used by the query parser to parse query terms. |
:wild_card_downcase | true | Specifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser. |
:fields | [] | Lets the query parser know what fields are available for
searching, particularly when the :* is specified as the search
field. |
:validate_fields | false | Set to true if you
want an exception to be raised if there is an attempt to search
a nonexistent field. |
:or_default | true | Use OR as the default
Boolean operator. |
:default_slop | 0 | Default slop to use in
PhraseQueries. |
:handle_parser_errors | true | QueryParser will
quietly handle all parsing internally. If you’d like
to handle them ,
set this parameter to false. |
:clean_string | true | QueryParser will
quickly review the query string to make sure that quotes and
brackets match up and special characters are escaped. |
:max_clauses | 512 | The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries. |
Filters in
our discussion of ConstantScoreQuery and FilteredQuery. Filters
are used to apply extra constraints to a result set. For example, we want
to restrict our search to documents that were created during the last
month. We have two options: add a RangeQuery clause to our query, or apply a
RangeFilter. The main advantage of
using a Filter over a Query is that no score is taken into
account, so a Filter can be a lot faster. To add to
that, Filters cache their results so that subsequent
uses of the Filter perform even better again. All
caching is done against an instance of an IndexReader, so a new cache needs to be
built each time a Filter is used against a different
IndexReader.Filters also make it easy to apply constraints to
user input queries. Filters are best used when applying
commonly used constraints to a user’s query, such as restricting a search
of a blog to only today’s postings or only to marked for publication.Filters that come
with Ferret:RangeFilterQueryFilterRangeFilter takes the same parameters as RangeQuery as
described in the ”
section earlier in this chapter. Basically, you need to supply a
:field and an upper and/or lower
limit for that field. For example, if you want to restrict a search to products
that are priced at $50.00 or more and less than $100.00, we would build
the filter like this:RangeFilter works only on fields that are correctly
lexically sorted, so you need to remember to pad all number fields to a
fixed width if you want to filter that field with a
RangeFilter.QueryFilter makes use of a query to filter
search results. The initial application of a
QueryFilter will be just as slow as if you added the
filter query as a :must clause to the
actual query. However, after caching, subsequent use of the
QueryFilter will be much faster.Array#sort method. However, this would take too long for large result sets, not
to mention use up a lot of unnecessary memory. Searcher provides a
:sort parameter for easy sorting. The
easiest way to specify a sort is to pass a sort string. A sort string is a comma-separated list of field
names with an optional DESC modifier to reverse the
sort for that field. The type of the field is automatically detected and
the field sorted accordingly. So Float fields will be
sorted by Float value, and Integer
fields will be sorted by Integer value.
SCORE and DOC_ID can be used in place of field names to sort by relevance and
internal document ID, respectively. Here are some examples:Sort and SortField.Ferret::Search::Searcher
and Ferret::Index::Index classes have a
highlight method. In this section, we’ll look at Index#highlight because it allows us to pass
string queries instead of having to build Query objects
(see ). Otherwise, both methods are essentially
the same. To use the highlight method,
you must supply a query and the document ID of the document you wish to
highlight. A number of other parameters can be used to describe exactly
how you want to highlight the field.| Parameter | Description |
|---|---|
:field | Defaults to @options[:default_field]. The
highlighter only works on one field at a time, so you need to
specify which field it is you want to highlight. If you want to
highlight multiple fields, you'll need to call this method
multiple times. |
:excerpt_length | Defaults to 150 bytes. This parameter specifies the length
of excerpt to show. The algorithm for extracting excerpts attempts
to fit as many matched terms into each excerpt as possible. If
you’d simply like the complete field back with all matches
highlighted, set this parameter to :all. |
:num_excerpts | Specifies the number of excerpts you wish to retrieve. This
defaults to 2,
:excerpt_length is set to
:all, in which case :num_excerpts is set to 1. |
:pre_tag | To highlight matches, you need to specify short strings to
place before and after . :pre_tag defaults to <b>, which is fine when printing
HTML, but if you are results to the console, we
recommend using something like |