O'Reilly logo

Ferret by David Balmain

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Search

Everything you’ve learned so far about creating indexes is pretty useless if you don’t know how to use those indexes to find what you are looking for. After all, that’s what Ferret is for. This chapter covers everything you need to know about searching in Ferret. We’ll start with the basic search classes followed by the various types of query. We’ll then talk about the query parser and Ferret’s own query language—FQL. We’ll then cover some more advanced topics such as sorting, filtering, and highlighting.

Overview of Searching Classes

Ferret’s search API is about as simple as its indexing API. In fact, if you are using the Index class, all you have to know is the search_each() method and a little bit of Ferret’s query language and you are set. However, if you take the time to learn the rest of the search API, you’ll discover a wealth of opportunities you didn’t even know existed.

The search API consists of the following classes:

  • IndexSearcher

  • Query

  • QueryParser

  • Filter

  • Sort

IndexSearcher

IndexSearcher, as the name would suggest, is used to search indexes. You can also use it to highlight and explain query results and read documents from the index (as you would with IndexReader). To create an IndexSearcher, you need to supply it with an IndexReader:

reader = IndexReader.new("path/to/index") 
searcher = Searcher.new(reader)

As usual, you can shortcut this by supplying it with a Directory or a filesystem path to the index:

searcher = Searcher.new("path/to/index")

Query

Ferret contains more than 15 different types of query, each of which you’ll learn about later in this chapter. Basically, queries are built and combined to specify what exactly it is you are looking for. You can then pass them to the IndexSearcher so it will retrieve your result set. Queries are the fundamental building block of the search API.

QueryParser

With more than 15 different types of query (each with its own definitive API), it can get quite tedious to build them by hand. Succinct as Ruby code is, it is much easier to build queries using a simple query language, not to mention the fact that you wouldn’t want users to have to type Ruby code into your search box. For example, let’s say we wanted to search for all articles in a blog that have the words “ruby” and “ferret” in either the title field or the content field. You could use the QueryParser:

query = query_parser.parse("title|content:(ruby AND ferret)")

Or you could build the query yourself. The QueryParser is the magic behind the Index class that makes it so easy to use.

Filter

You can already specify exactly which documents you want to find using the various Query classes, so you might be wondering what you need a Filter class for. Filters have a few purposes. First, Filters actually cache their results, so if you have a particular query that is run over and over again, you might want to convert it to a Filter to improve performance. Second, Filters can be used to apply common constraints to Queries. For example, to restrict a user’s search to only published articles, you would use a Filter. Or you might let users filter their own searches with a drop-down menu of common filters, perhaps a filter that restricts search results to the last month or the last seven days. This is particularly useful because most users won’t know the range query syntax. You’ll learn more about when to use a filter and how to create your own filter later in this chapter.

Sort

By default, the IndexSearcher returns your query results in order of relevance. If you want to sort the results in any other way, you are going to have to use the Sort class. As discussed earlier in Chapter 2, you can currently sort by Integer, Float, and String. We’ll cover this in more detail in the Sorting Search Results” section later in this chapter.

Building Queries

Even if you are using the QueryParser to build all your queries, you’ll gain a better understanding of how searching works in Ferret by building each of the queries by hand. We’ll also include the Ferret Query Language (FQL) syntax for each different type of query as we go. As you read, you’ll find some queries that you can’t build even using the QueryParser, so it will be useful to learn about them as well.

Before we get started, we should mention that each Query has a boost field. Because you will usually be combining queries with a BooleanQuery, it can be useful to give some of those queries a higher weighting than the other clauses in the BooleanQuery. All Query objects also implement hash and eql?, so they can be used in a HashTable to cache query results.

TermQuery

TermQuery is the most basic of all queries and is actually the building block for most of the other queries (even where you wouldn’t expect it, like in WildcardQuery and FuzzyQuery). It is very simple to use. All you need to do is specify the field you want to search in and the term you want to search for:

# FQL: "title:shawshank"
query = TermQuery.new(:title, "shawshank")

BooleanQuery

BooleanQueries are used to combine other queries. Combined with TermQuery, they cover most of the queries users use every day on the major search engines. We already saw an example of a BooleanQuery earlier, but we didn’t explain how it works. A BooleanQuery is implemented as a list of BooleanClauses. Each clause has a type: :should, :must, or :must_not. :should clauses add value to the relevance score when they are found, but the query won’t reject a document just because the clause isn’t present. This is the type of clause you would find in an “or” query. A :must clause, on the other hand, must be present in a document for that document to be returned as a hit. Finally, a :must_not clause causes the BooleanQuery to reject all documents that contain that clause. For example, say we want to find all documents that contain the word “rails”, but we don’t want the documents about trains, and we’d especially like the “rails” documents that all contain the term “ruby”. We’d implement this query like this:

# FQL: "content:(+rails -train ruby)"
query = BooleanQuery.new()
query.add_query(TermQuery.new(:content, "rails"), :must)
query.add_query(TermQuery.new(:content, "train"), :must_not)
query.add_query(TermQuery.new(:content, "ruby"),  :should)

One rule to remember when creating BooleanQueries is that every BooleanQuery must include at least one :must or :should parameter. A BooleanQuery with only :must_not clauses will not raise any exceptions, but it also won’t return any results. If you want to find all documents without a certain attribute, you should add a MatchAllQuery to your BooleanQuery. Let’s say you want to find all documents without the word “spam”:

# FQL: "* -content:spam"
query = BooleanQuery.new()
query.add_query(MatchAllQuery.new, :should)
query.add_query(TermQuery.new(:content, "spam"), :must_not)

PhraseQuery

Once you add PhraseQuery to your bag of tricks, you can build pretty much any query that most users would apply in their daily search engine usage. You build queries by adding one term at a time with a position increment. For example, let’s say that we want to search for the phrase “quick brown fox”. We’d build the query like this:

# FQL: 'content:"quick brown fox"'
query = PhraseQuery.new(:content)
query.add_term("quick", 1)
query.add_term("brown", 1)
query.add_term("fox", 1)

Ferret’s PhraseQueries offer a little more than usual phrase queries. You can actually skip positions in the phrase. For example, let’s say we don’t care what color the fox is; we just want a “quick <> fox”. We can implement this query like this:

# FQL: 'content:"quick <> fox"'
query = PhraseQuery.new(:content)
query.add_term("quick", 1)
query.add_term("fox", 2)

What if we want a “red”, “brown”, or “pink” fox that is either “fast” or “quick”? We can actually add multiple terms to a position at a time:

# FQL: 'content:"quick|fast red|brown|pink fox"'
query = PhraseQuery.new(:content)
query.add_term(["quick", "fast"], 1)
query.add_term(["red", "brown", "pink"], 1)
query.add_term("fox", 1)

So far, we’ve been strict with the order and positions of the terms. But Ferret also allows sloppy phrases. That means the phrase doesn’t need to be exact; it just needs to be close enough. Let’s say you want to find all documents mentioning “red-faced politicians”. You’d also want all documents containing the phrase “red-faced Canadian politician” or even “the politician was red-faced”. This is where sloppy queries come in handy:

# FQL: 'content:"red-faced politician"~4'
query = PhraseQuery.new(:content, 4) # set the slop to 4
query.add_term("red-faced", 1)
query.add_term("politician", 1)

# you can also change the slop like this
query.slop = 1

The key to understanding sloppy phrase queries is knowing how the slop is calculated. You can think of a phrase’s “slop” as its “edit distance”. It is the minimum number of steps that you need to move the terms from the original search phrase to get the phrase occurring in the document (see Figure 4-1).

Sloppy PhraseQuery
Figure 4-1. Sloppy PhraseQuery

The first phrase is an exact match, so the slop is 0. In the next phrase you need to move “politician” right once, so the slop is 1. The third phrase shows that the terms don’t need to be in order. Just move “politician” left three times and you have a match. Hence, the slop is 3.

RangeQuery

Now we are getting into some of the more specialized queries available in Ferret. RangeQuery does exactly what you would expect it to do: it searches for ranges of values. Most of the time, RangeQueries are used on date or number fields. Make sure you have these set up correctly as described in the Date Fields” section in Chapter 2. For example, if you want to search for all blog entries between June 1, 2005 and March 15, 2006, you could build the query like this:

# FQL: 'date:[20050501 20060315]'
query = RangeQuery.new(:date, :lower => "20050501", :upper => "20060315")

We don’t need to include both ends of the range. We could search for all entries before Christmas 2005:

# FQL: 'date:<20051225]'
query = RangeQuery.new(:date, :upper => "20051225")

Or all entries after Christmas 2005:

# FQL: 'date:[20051225>'
query = RangeQuery.new(:date, :lower => "20051225")

So, what happens to the blog entries from Christmas day in these two examples? Both queries return blog entries from Christmas day because these bounds are inclusive. That is, they include all terms where :lower <= term <= :upper. We can easily make RangeQuery bounds exclusive. If we want to make the first example exclusive, we write it like this:

# FQL: 'date:{20050501 20060315}'
query = RangeQuery.new(:date,
                       :lower_exclusive => "20050501",
                       :upper_exclusive => "20060315")

This feature is useful for paging through documents by field value. Say we want to page through all the products in our database by price, starting with all products under $10, then all products between $10 and $20, etc., up to $100. We could do it like this:

10.times do |i|
  lower_price = "%06.2f" % (i * 10)
  upper_price = "%06.2f" % ((i + 1) * 10)

  query = RangeQuery.new(:price,
                         :lower => lower_price,
                         :upper_exclusive => upper_price)

  puts "products from $#{lower_price.to_f} to $#{upper_price.to_f}"
  index.search_each(query) do |doc_id, score|
    puts "    #{index[doc_id][:title]}"
  end
end

RangeQuery will work just as well on string fields. Just keep in mind that the terms are always sorted as if they were binary strings, so you may get some unexpected results if you are sorting multibyte character encodings.

MultiTermQuery

This is kind of like an optimized Boolean OR query. The optimization comes from the fact that it searches only a single field, making lookup a lot faster because all clauses use the same section of the index. As usual, it is very simple to use. Let’s say you want to find all documents with the term “fast” or a synonym for it:

# FQL: 'content:"fast|quick|rapid|speedy|swift"'
query = MultiTermQuery.new(:content)
query.add_term("quick")
query.add_term("fast")
query.add_term("speedy")
query.add_term("swift")
query.add_term("rapid")

But there’s more. What if you would prefer documents with the term “quick” and you don’t really like the term “speedy”? You can program it like this:

# FQL: 'content:"speedy^0.5|fast|rapid|swift|quick^10.0"'
query = MultiTermQuery.new(:content)
query.add_term("quick", 10.0)
query.add_term("fast")
query.add_term("speedy", 0.5)
query.add_term("swift")
query.add_term("rapid")

You may be wondering what use this is, since we can perform this query (including the term weighting) with a BooleanQuery. The reason it is included is that it is used internally by a few of the more advanced queries that we’ll be looking at in a moment: PrefixQuery, WildcardQuery, and FuzzyQuery. In Apache Lucene, these queries are rewritten as BooleanQueries and they tend to be very resource-expensive queries. But a BooleanQuery for this task is overkill, and there are a few optimizations we can make because we know all terms are in the same field and all clauses are :should clauses. For this reason, MultiTermQuery was created, making WildcardQuery and FuzzyQuery much more viable in Ferret.

When some of these queries are rewritten to MultiTermQueries, there is a risk that they will add too many terms to the query. Say someone comes along and submits the WildcardQuery?*” (i.e., search for all terms). If you have a million terms in your index, you could run into some memory overflow problems. To prevent this, MultiTermQuery has a :max_terms limit that is set to 512 by default. You can set this to whatever value you like. If you try to add too many terms, by default the lowest scored terms will be dropped without any warnings. You can increase the :max_terms like this:

query = MultiTermQuery.new(:content, :max_terms => 1024)

You also have the option of setting a minimum score. This is another way to limit the number of terms added to the query. It is used by FuzzyQuery, in which case the range of scores is 0..1.0. You shouldn’t use this parameter in either PrefixQuery or WildcardQuery. The only other time you would probably use this is when building a custom query of your own:

query = MultiTermQuery.new(:content,
                           :max_terms => 1024,
                           :min_score => 0.5)

PrefixQuery

The PrefixQuery is useful if you want to store a hierarchy of categories in the index as you might do for blog entries. You could store them using a Unix filename-like string:

index << { :category => "/sport/"               }
index << { :category => "/sport/judo/"          }
index << { :category => "/sport/swimming/"      }
index << { :category => "/coding/"              }
index << { :category => "/coding/c/"            }
index << { :category => "/coding/c/ferret"      }
index << { :category => "/coding/lisp/"         }
index << { :category => "/coding/ruby/"         }
index << { :category => "/coding/ruby/ferret/"  }
index << { :category => "/coding/ruby/hpricot/" }
index << { :category => "/coding/ruby/mongrel/" }

Note that the :category field in this case should be untokenized. Now you can find all entries relating to Ruby using a PrefixQuery:

# FQL: 'category:/coding/ruby/*'
query = PrefixQuery.new(:category, "/coding/ruby/")

PrefixQuery is the first of the queries covered here that use the MultiTermQuery. As we mentioned in the previous section, MultiTermQuery has a maximum number of terms that can be inserted. Let’s say you have 2,000 categories and someone submits the prefix query with / as the prefix, the root category. Ferret will try to load all 2,000 categories into the MultiTermQuery, but MultiTermQuery will only allow the first 512 and no more—all others will be ignored. You can change this behavior when you create the PrefixQuery using the :max_terms property:

# FQL: 'category:/*'
query = PrefixQuery.new(:category, "/",
                        :max_terms => 1024)

WildcardQuery

WildcardQuery allows you to run searches with two simple wildcards: * matches any number of characters (0..infinite), and ? matches a single character. If you look at PrefixQuery’s FQL, you’ll notice that it looks like a WildcardQuery. Actually, if you build a WildcardQuery with only a single * at the end of the term and no ?, it will be rewritten internally, during search, to a PrefixQuery. Add to that, if you create a WildcardQuery with no wildcards, it will be rewritten internally to a TermQuery. The Wildcard API is pretty similar to PrefixQuery:

# FQL: 'content:dav?d*'
query = WildcardQuery.new(:content, "dav?d*")

Just like PrefixQuery, WildcardQuery uses MultiTermQuery internally, so you can also set the :max_terms property:

# FQL: 'content:f*'
query = WildcardQuery.new(:content, "f*",
                          :max_terms => 1024)

You should be very careful with WildcardQueries. Any query that begins with a wildcard character (* or ?) will cause the searcher to enumerate and scan the entire field’s term index. This can be quite a performance hit for a very large index. You might want to reject any WildcardQueries that don’t have a non-wildcard prefix.

There is one gotcha we should mention here. Say you want to select all the documents that have a :price field. You might first try:

# FQL: 'price:*'
query = WildcardQuery.new(:price, "*")

This looks like it should work, right? The problem is, * matches even empty fields and actually gets optimized into a MatchAllQuery. On the bright side, the performance problems that plague WildcardQueries that start with a wildcard character don’t actually apply to plain old * searches. So, back to the problem at hand, we can find all documents with a :price field like this:

# FQL: 'price:?*'
query = WildcardQuery.new(:price, "?*")

However, don’t forget about the performance implications of doing this. Think about building a custom Filter to perform this operation instead.

FuzzyQuery

FuzzyQuery is to TermQuery what a sloppy PhraseQuery is to an exact PhraseQuery. FuzzyQueries match terms that are close to each other but not exact. For example, “color” is very close to “colour”. FuzzyQuery can be used to match both of these terms. Not only that, but they are great for matching misspellings like “collor” or “colro”. We can build the query like this:

# FQL: 'content:color~'
query = FuzzyQuery.new(:content, "color")

Again, just like PrefixQuery, FuzzyQuery uses MultiTermQuery internally so you can also set the :max_terms property:

# FQL: 'content:color~'
query = FuzzyQuery.new(:content, "color",
                       :max_terms => 1024)

FuzzyQuery is implemented using the Levenshtein distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance). The Levenshtein distance is similar to slop. It is the number of edits needed to convert one term to another. So “color” and “colour” have a Levenshtein distance score of 1.0 because a single letter has been added. “Colour” and “coller” have a Levenshtein distance score of 2.0 because two letters have been replaced. A match is determined by calculating the score for a match. This is calculated with the following formula, where target is the term we want to match and term is the term we are matching in the index.

Levenshtein distance score:

          1 − distance / min(target.size, term.size)

This means that an exact match will have a score of 1.0, whereas terms with no corresponding letters will have a score of 0.0. Since FuzzyQuery has a limit to the number of matching terms it can use, the lowest scoring matches get discarded if the FuzzyQuery becomes full.

Because of the way FuzzyQuery is implemented, it needs to scan every single term in its field’s index to find all valid similar terms in the dictionary. This can take a long time if you have a large index. One way to prevent any performance problems is to set a minimum prefix length. Do this by setting the :min_prefix_len parameter when creating the FuzzyQuery. This parameter is set to 0 by default; hence, the fact that it would need to scan every term in index.

To minimize the expense of finding matching terms, we could set the minimum prefix length of the example query to 3. This would greatly reduce the number of terms that need to be enumerated, and “color” would still match “colour”, although “cloor” would no longer match:

# FQL: 'content:color~' => no way to set :min_prefix_length in FQL
query = FuzzyQuery.new(:content, "color",
                       :max_terms => 1024,
                       :min_prefix_length => 3)

You can also set a cut-off score for matching terms by setting the :min_similarity parameter. This will not affect how many terms are enumerated, but it will affect how many terms are added to the internal MultiTermQuery, which can also help improve performance:

# FQL: 'content:color~0.8' => no way to set :min_prefix_length in FQL
query = FuzzyQuery.new(:content, "color",
                       :max_terms => 1024,
                       :min_similarity => 0.8,
                       :min_prefix_length => 3)

In some cases, you may want to change the default values for :min_prefix_len and :min_similarity, particularly for use in the Ferret QueryParser. Simply set the class variables in FuzzyQuery:

FuzzyQuery.default_min_similarity = 0.8
FuzzyQuery.default_prefix_length = 3

MatchAllQuery

This query matches all documents in the index. The only time you’d really want to use this is in combination with a negative clause in a BooleanQuery or in combination with a filter, although ConstantScoreQuery makes more sense for the latter:

# FQL: '* -content:spam'
query = BooleanQuery.new()
query.add_query(MatchAllQuery.new, :should)
query.add_query(TermQuery.new(:content, "spam"), :must_not)

ConstantScoreQuery

This query is kind of like MatchAllQuery except that it is combined with a Filter. This is useful when you need to apply more than one filter to a query. It is also used internally by RangeQuery. “Constant Score” means that all hits returned by this query have the same score, which makes sense for queries like RangeQueries where either a document is in the range or it isn’t:

# FQL: 'date:[20050501 20060315]'
filter = RangeFilter.new(:date, :lower => "20050501", :upper => "20060315")
query = ConstantScoreQuery.new(filter)

FilteredQuery

So, what is the difference between this query and the previous one? Not a lot, really. The following two queries are equivalent:

# FQL: 'date:[20050501 20060315] && content:ruby'
query1 = FilteredQuery.new(TermQuery.new(:content, "ruby"),
                           RangeFilter.new(:date,
                                           :lower => "20050501",
                                           :upper => "20060315"))

# FQL: 'date:[20050501 20060315] && content:ruby'
filter = RangeFilter.new(:date, :lower => "20050501", :upper => "20060315")
query2 = BooleanQuery.new()
query2.add_query(TermQuery.new(:content, "ruby"), :must)
query2.add_query(ConstantScoreQuery.new(filter), :must)

It’s really just a matter of taste. There is a slight performance advantage to using a FilteredQuery, although the QueryParser will create a BooleanQuery.

Span Queries

Span queries are a little different from the queries we’ve covered so far in that they take into account the range of the terms matched. In the PhraseQuery” section earlier in this chapter, we talked about using PhraseQuery to implement a simple Boolean AND query that ensured that the terms were close together. Span queries are designed to do that and more.

A couple of things to note here. First, span queries can contain only other span queries, although they can be combined with other queries using a BooleanQuery. Second, any one span query can contain only a single field. Even when you are using SpanOrQuery, you must ensure that all span queries added are on the same field; otherwise, an ArgumentError will be raised.

SpanTermQuery

The SpanTermQuery is the basic building block for span queries. It’s almost identical to the basic TermQuery. The difference is that it enumerates the positions of its matches. These positions are used by the rest of the span queries:

include Ferret::Search::Spans
# FQL: There is no Ferret Query Language for SpanQueries yet
query = SpanTermQuery.new(:content, "ferret")

Because of the position enumeration, SpanTermQuery will be slower than a plain TermQuery, so it should be used only in combination with other span queries.

SpanFirstQuery

This is where span queries start to get interesting. SpanFirstQuery matches terms within a limited distance from the start of the field, the distance being a parameter to the constructor. This type of query can be useful because often the terms occurring at the start of a document are the most important terms in a document. To find all documents with “ferret” within the first 100 terms of the :content field, we do this:

include Ferret::Search::Spans
# FQL: There is no Ferret Query Language for SpanQueries yet
query = SpanTermQuery.new(:content, "ferret")

SpanOrQuery

This query is pretty easy to understand. It is just like a BooleanQuery that does only :should clauses. With this query we can find all documents with the term “rails” in the first 100 terms of the :content field and the term “ferret” anywhere:

# FQL: There is no Ferret Query Language for SpanQueries yet
span_term_query = SpanTermQuery.new(:content, "rails")
span_first_query = SpanFirstQuery.new(span_term_query, 100)
span_term_query = SpanTermQuery.new(:content, "ferret")

query = SpanOrQuery.new()
query.add(span_term_query)
query.add(span_first_query)

Let’s reiterate here that all span queries you add to SpanOrQuery must be on the same field.

SpanNotQuery

This query can be used to exclude span queries. This gives us the ability to exclude documents based on span queries, as you would do with a :must_not clause in a BooleanQuery. Let’s exclude all documents matched by the previous query that contain the terms “otter” or “train”:

# FQL: There is no Ferret Query Language for SpanQueries yet
span_term_query = SpanTermQuery.new(:content, "rails")
span_first_query = SpanFirstQuery.new(span_term_query, 100)

inclusive_query = SpanOrQuery.new()
inclusive_query.add(span_first_query)
inclusive_query.add(SpanTermQuery.new(:content, "ferret"))

exclusive_query = SpanOrQuery.new()
exclusive_query.add(SpanTermQuery.new(:content, "otter"))
exclusive_query.add(SpanTermQuery.new(:content, "train"))

query = SpanNotQuery.new(inclusive_query, exclusive_query)

SpanNearQuery

This is the one you’ve been waiting for, the king of all span queries. This allows you to specify a range for all the queries it contains. For example, if we set the range to equal 100, all span queries within this query must have matches within 100 terms of each other or the document won’t be a match. Let’s simply search for all documents with the terms “ferret”, “ruby”, and “rails” within a 50 term range:

# FQL: There is no Ferret Query Language for SpanQueries yet
query = SpanNearQuery.new(:slop => 50)
query.add(SpanTermQuery.new(:content, "ferret"))
query.add(SpanTermQuery.new(:content, "ruby"))
query.add(SpanTermQuery.new(:content, "rails"))

Actually, we could have just done that with a sloppy PhraseQuery. But by combining other span queries, we can do a lot of things that a sloppy PhraseQuery can’t handle. One other thing that a sloppy PhraseQuery can’t do is force the terms to be in correct order. With the SpanNearQuery, we can force the terms to be in correct order:

# FQL: There is no Ferret Query Language for SpanQueries yet
query = SpanNearQuery.new(:slop => 50, :in_order => true) #set in_order to true
query.add(SpanTermQuery.new(:content, "ferret"))
query.add(SpanTermQuery.new(:content, "ruby"))
query.add(SpanTermQuery.new(:content, "rails"))

This will match only documents that have the terms “ferret”, “ruby”, and “rails” within 50 terms of each other and in that particular order.

Boosting Queries

We mentioned at the start of this chapter that you can boost queries. This can be very handy when you want to make one term more important than your other search terms. For example, let’s say you want to search for all documents with the term “ferret” and the terms “ruby” or “rails”, but you’d much rather have documents with “rails” than just “ruby”. You’d implement the query like this:

# FQL: 'content:(+ferret rails^10.0 ruby^0.1))'
query = BooleanQuery.new()

term_query = TermQuery.new(:content, "ferret")
query.add_query(term_query, :must)

term_query = TermQuery.new(:content, "rails")
term_query.boost = 10.0
query.add_query(term_query, :should)

term_query = TermQuery.new(:content, "ruby")
term_query.boost = 0.1
query.add_query(term_query, :should)

Unlike boosts used in Documents and Fields, these boosts aren’t translated to and from bytes so they don’t lose any of their precision. As for deciding which values to use, it will still require a lot of experimentation. Use the Search::Searcher#explain and Index::Index#explain methods to see how different boost values affect scoring.

QueryParser

You’ve now been introduced to all the different types of queries available in Ferret, and you’ve learned how to build different queries by hand. Some of it probably seems like a lot of work and it’s certainly not something you’d ask a user to do. Luckily, we can leave most of the work to the Ferret QueryParser. You’ve already seen many examples of the Ferret Query Language (FQL) in the previous section (Building Queries”), and you’ll have noticed that most of the queries you can build in code can be described much more easily in FQL. In this section, we’ll talk about setting up the QueryParser, and then we’ll go into more detail about FQL.

Setting Up the QueryParser

The QueryParser has a number of parameters, as shown in Table 4-1.

Table 4-1. QueryParser parameters
ParameterDefaultShort description
:default_field :* The default field to be searched; it can also be an array.
:analyzer StandardAnalyzer Analyzer used by the query parser to parse query terms.
:wild_card_downcase true Specifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser.
:fields [] Lets the query parser know what fields are available for searching, particularly when the :* is specified as the search field.
:validate_fields false Set to true if you want an exception to be raised if there is an attempt to search a nonexistent field.
:or_default true Use OR as the default Boolean operator.
:default_slop 0Default slop to use in PhraseQueries.
:handle_parser_errors true QueryParser will quietly handle all parsing errors internally. If you’d like to handle them yourself, set this parameter to false.
:clean_string true QueryParser will quickly review the query string to make sure that quotes and brackets match up and special characters are escaped.
:max_clauses 512The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries.

The first thing you need to think about when setting up the QueryParser is which analyzer to use. Preferably, you should use the same analyzer you used to tokenize your documents during indexing. This analyzer will be used to analyze all terms before they are added to queries, except in the case of wildcard queries, since they’ll contain * and ?, which many analyzers won’t accept. Because of this, you’ll probably need to lowercase the wildcard query if the analyzer you used was a lowercasing analyzer. The exception to this rule is the use of wildcard queries on fields that are untokenized, in which case you might want to leave them as case-sensitive. To specify whether or not wildcard queries are lowercased, you need to set the parameter :wild_card_downcase. It is set to true by default.

The next thing you need to worry about is document fields. First of all, which fields are available to be searched? When the user specifies the field he wants to search, he can use an * to search all fields. For this to work, you need to set up the QueryParser so that it knows which fields are available. Simply set the parameter :fields to an array of field names. You can get the list of available field names from an IndexReader:

query = QueryParser.new(:fields => reader.fields,
                        :tokenized_fields => reader.tokenized_fields)

The :fields parameter can be either a Symbol or an Array of Symbols. You can also set the QueryParser to validate all queries that use these fields. That is, each time a user selects a field to search, the query parser will check that that field is present in the @fields attribute, and if it isn’t, it will raise an exception. So, if your index has a :title and a :content field and the user tries to search in a :contetn field (note the misspelling), the QueryParser will raise an exception. To make the QueryParser validate fields, you need to set the :validate_fields parameter to true. It is set to false by default.

Once you have specified which fields are available, you need to designate which of those fields you want to be searched by default. Simply set the parameter :default_field to a single field name or an array of field names. You can even set it to the symbol :*, which will specify that you want to search all fields by default. :* is in fact the default value.

Next, you must decide if you want Boolean queries to be OR or AND by default. This involves setting :or_default to true or false. By default, it is set to true, but if you want to make your search more like a regular search engine, you should set it to false.

The QueryParser handles parse errors for you by default. It does this by trying to parse the query according to the grammar. If that fails, it tries to parse it into a simple term or Boolean query, ignoring all query language tokens. If it still can’t do that, it will return an empty Boolean query. No exception will be thrown and the user will just see an empty set of results. If you’d like to handle the parse errors yourself, you can set the parameter :handle_parse_errors to false. You can then let the user know that the query she entered was invalid.

Also, to make QueryParser more robust, it has a clean_string method that basically makes sure brackets and quotes match up and that all special characters within phrase strings are properly escaped. For example, the following query:

(city:Braidwood AND shop:(Torpy's OR "Pig & Whistle

will be cleaned up as:

(city:Braidwood AND shop:(Torpy's OR "Pig \& Whistle"))

Perhaps you want to clean the query strings yourself or you would prefer to have an exception raised if the query can’t be parsed. To do this, set the :clean_string parameter to false.

Because MultiTermQueries have a :max_terms property, you can set the default value used for :max_terms by the query parser by setting its :max_clauses parameter. This will also affect the maximum number of clauses you can add to a BooleanQuery.

Ferret Query Language

The Ferret Query Language allows you to build most of the queries that you can build with Ruby code using just a simple query string. For simple queries, it matches what users have come to expect from years of using different search engines. But FQL allows you to build much more diverse queries than the usual search engine queries allow. FQL aims to be as concise as possible while still being readable and hopefully obvious to most users.

TermQuery

To express the simplest of all queries in FQL, simply type the term you wish to find:

            'ferret'

When parsed by the QueryParser, this string will be translated to a query that will search for the term “ferret” in all fields specified with :default_fields parameter.

To constrain your search to a field other than the field(s) specified by :default_fields, prefix your search with the field name followed by a colon. For example, if you want to search the :title field for the term “Ruby”, you would do so like this:

            'title:Ruby'

Searching multiple fields is easy, too. Simply separate field names with a | character. So, to search :title and :content fields for “Ruby”, you would type:

            'title|content:Ruby'

You can match all fields with the * character:

            '*:Ruby'

That’s all there is to specifying the field to search. If you want to search for documents that contain the term Ruby in both the :title and :content fields, you will need to use a Boolean query.

BooleanQuery

Most readers have used Boolean queries in search applications before. The most common syntax makes use of the + and characters, + indicating terms that must occur and indicating terms that must not occur. So, to search for documents on “Ferret” that preferably have the term “Ruby” and must not have the term “pet”, you would type the following query:

            '+Ferret Ruby -pet'

+ and can also be rewritten as “REQ” and “NOT”, respectively. Ferret also supports the “AND” and “OR” keywords. “AND” has precedence over “OR”, but this can be overridden with the use of parentheses: ( and ). So, to search for a chocolate or caramel sundae, you’d type:

            '(chocolate OR caramel) AND sundae'

This could also be written as:

            '+(chocolate caramel) +sundae'

It’s just a matter of personal preference.

Field constraints can be applied to individual terms or whole Boolean queries wrapped in brackets:

            '+flavour:(chocolate caramel) +name:sundae'

Inner field constraints override outer field constraints, so the following is equivalent to the previous query:

            'name:(flavour:(chocolate caramel) AND sundae)'

PhraseQuery

As you would expect, in FQL phrase queries are identified by " characters. So, to search for the phrase “quick brown fox”, your query would be just that:

            '"quick brown fox"'

But Ferret phrase queries offer a lot more. You can specify a list of options for a term in a phrase. Let’s say we don’t care if the fox is “red”, “orange”, or “brown”. You could search for the following phrase:

            '"quick red|orange|brown fox"'

We could even accept absolutely anything in a term’s position. For example, the following would match “quick hungry fox”:

            '"quick <> fox"'

In the PhraseQuery” section earlier in this chapter, we also discussed sloppy phrase queries. The phrase slop can be indicated using the ~ character followed by an integer slop value. For example, the following query would match the phrase “quick brown and white fox”:

            '"quick fox"~3'

As with other types of query, phrase queries can have field constraints applied to them:

            'content|title:"quick fox"~3'

RangeQuery

Range queries can be specified in a couple of different ways. The [] and {} brackets represent inclusive and exclusive limits, respectively. This syntax is inherited from the Apache Lucene query syntax. Let’s say you want all documents created on or after the 25th of April, 2006, and before the 11th of November (but not on that day). You would specify the query like this:

            'created_on:[20060425 20061111}'

In FQL, you can also express upper and lower bounded range queries. The open bounds are identified by the > and < tokens. For example, if I want all documents created after the 25th of July, 1977, I would write the query like this:

            'created_on:[20060725>'

To find all documents created before that date, you could type this:

            'created_on:<20060725}'

Alternatively, you can use the >, <, >=, and <= tokens to specify singly bounded range queries. The previous two queries would be, respectively:

            'created_on:>= 20060725'
'created_on:< 20060725'

WildcardQuery

Wildcard queries in Ferret make use of the * and ? characters. Just to reiterate what we covered in the WildcardQuery” section earlier in this chapter, * will match any number of characters, whereas ? will match a single character only. So, the following query will match all documents with the terms “lend”, “legend”, or “lead” in the :content field:

            'content:l*e?d'

We can also use wildcard query syntax to create a few other types of queries. For example, to create a MatchAllQuery, you would type:

            '*'

Note that it makes no difference if we add a field constraint to this query. To find all documents with a price field, you might be tempted to type the following:

            'price:*'

But this will match all documents. Instead, you need to type the following:

            'price:?*'

You can also create prefix queries using wildcard syntax. Simply type the prefix and append * to the end. For example:

            'category:/programming/ruby/*'

This query will be optimized into a PrefixQuery by the QueryParser.

FuzzyQuery

In the FuzzyQuery” section earlier in this chapter, we said that FuzzyQueries are to TermQueries as sloppy PhraseQueries are to standard PhraseQueries, so it should come as no surprise that FuzzyQueries use the same syntax as sloppy PhraseQueries. Instead of a “slop” integer, however, we have a “similarity” float, which must be between 0.0 and 1.0. Another difference is that FuzzyQueries have a default similarity of 0.5, so you don’t need to specify a similarity value at all. Let’s say, for example, that we wish to find all documents containing the commonly misspelled word “mischievous”:

            'mischievous~'

Or we could make the query more strict by increasing the similarity value, like this:

            'mischievous~0.8'

Remember that FuzzyQueries are expensive queries to use on a large index, so you may want to set the default prefix length as described at the end of the FuzzyQuery” section.

Boosting a query in FQL

Boosting queries in FQL is a simple matter of appending the query with ^ and a boost value (see the Boosting Queries” section earlier in this chapter). For example, let’s go back to our Boolean search for “Ferret” where the results included the term “Ruby” but not the term “pet”. In this case, “Ferret” is the most important term, so we should boost it:

            '+Ferret^10.0 Ruby^0.1 -pet'

Note that it makes no sense to boost negative clauses in a boolean query. We should also note that the boost comes after the slop in a sloppy PhraseQuery and the similarity in a FuzzyQuery:

            '"quick brown fox"~5^10.0 AND date:>=20060601'
            'mischievous~0.8^10.0 AND date:>=20060601'

Filtering Search Results

We’ve already mentioned Filters in our discussion of ConstantScoreQuery and FilteredQuery. Filters are used to apply extra constraints to a result set. For example, we want to restrict our search to documents that were created during the last month. We have two options: add a RangeQuery clause to our query, or apply a RangeFilter. The main advantage of using a Filter over a Query is that no score is taken into account, so a Filter can be a lot faster. To add to that, Filters cache their results so that subsequent uses of the Filter perform even better again. All caching is done against an instance of an IndexReader, so a new cache needs to be built each time a Filter is used against a different IndexReader.

Filters also make it easy to apply constraints to user input queries. Filters are best used when applying commonly used constraints to a user’s query, such as restricting a search of a blog to only today’s postings or only to postings marked for publication.

There are only two standard Filters that come with Ferret:

  • RangeFilter

  • QueryFilter

Using the RangeFilter

RangeFilter takes the same parameters as RangeQuery as described in the RangeQuery” section earlier in this chapter. Basically, you need to supply a :field and an upper and/or lower limit for that field. For example, if you want to restrict a search to products that are priced at $50.00 or more and less than $100.00, we would build the filter like this:

price_filter = RangeFilter.new(:price, :>= => "050.00", :< => "100.00")

Note again the way we padded the price values. RangeFilter works only on fields that are correctly lexically sorted, so you need to remember to pad all number fields to a fixed width if you want to filter that field with a RangeFilter.

Using the QueryFilter

QueryFilter makes use of a query to filter search results. The initial application of a QueryFilter will be just as slow as if you added the filter query as a :must clause to the actual query. However, after caching, subsequent use of the QueryFilter will be much faster.

A good example of where you might use a QueryFilter is to restrict a search to only published articles in a CMS (Content Management System). You would create the filter like this:

published_filter = QueryFilter.new(TermQuery.new(:state, "published"))

Remember that to take full advantage of the Filter properties you should only create this filter once and keep a handle to it. Don’t create a new QueryFilter every time the search method is invoked.

Writing Your Own Filter

Writing your own filter turns out to be pretty easy. All you need to do is implement a bits method, which takes an IndexReader and returns a BitVector. The best way to explain this is with an example. Let’s build a RangeFilter that works for floats that haven’t been padded to fixed width:

  0 require 'rubygems'
  1 require 'ferret'
  2 
  3 class FloatRangeFilter
  4   attr_accessor :field, :upper, :lower, :upper_op, :lower_op
  5 
  6   def initialize(field, options)
  7     @field = field
  8     @upper = options[:<] || options[:<=]
  9     @lower = options[:>] || options[:>=]
 10     if @upper.nil? and @lower.nil?
 11       raise ArgError, "Must specify a bound"
 12     end
 13     @upper_op = options[:<].nil? ? :<= : :< 
 14     @lower_op = options[:>].nil? ? :>= : :> 
 15   end
 16 
 17   def bits(index_reader)
 18     bit_vector = Ferret::Utils::BitVector.new
 19     term_doc_enum = index_reader.term_docs
 20     index_reader.terms(@field).each do |term, freq| 
 21       float = term.to_f
 22       next if @upper and not float.send(@upper_op, @upper) 
 23       next if @lower and not float.send(@lower_op, @lower) 
 24       term_doc_enum.seek(@field, term)
 25       term_doc_enum.each {|doc_id, freq| bit_vector.set(doc_id)} 
 26     end
 27     return bit_vector
 28   end
 29 
 30   def hash
 31     return @field.hash ^ @upper.hash ^ @lower.hash ^
 32            @upper_op.hash ^ @lower_op.hash
 33   end
 34 
 35   def eql?(o)
 36     return (o.instance_of?(FloatRangeFilter) and @field == o.field and
 37             @upper == o.upper and @lower == o.lower and
 38             @upper_op == o.upper_op and @lower_op == o.lower_op)
 39   end
 40end

You instantiate this by passing a field name and one or two of the optional parameters (:<, :<=, :>, and :>=) used to specify the bounds. These optional parameters should be Floats. The most important method in this class is the bits method. Starting from line 20, it iterates through all the terms in the specified field, converts the term to a Float, and checks that it is in the required range.

There is a little bit of trickiness on lines 22 and 23 where we are checking that the term is within the required range. f.send(@upper_op, @upper) translates either to f < @upper or to f <= @upper, depending on which of the less-than parameters (:< or :<=) were passed. @upper_op gets set on line 13.

Once we know that the term falls within the required range, the next step is to fill in the bits in the BitVector for all the documents in which that term appears. We do this on line 25 using a TermDocEnum. The final BitVector has a bit set for every document in the index that has a term in the specified field within the required floating-point range.

Using our new custom filter is simple. Simply pass it as the :filter parameter:

filter = FloatRangeFilter.new(:price, :< => 100.0, :>= => 10.0)
searcher.search_each("*", :filter => filter) do |d, s|
  puts "price => #{searcher[d][:price]}"
end

In this example, we would get all products with a price of $10.00 or more and less than $100.00.

:filter_proc, the New Filter

The :filter_proc parameter of the Searcher#search methods is one of the more recent additions to the Ferret arsenal. It enables you to do a lot of things that were impossible with only Filter objects. Basically, you supply a Proc object that gets called for every result in the result set. The Proc object takes three parameters: a document ID, a score, and the Searcher object. So, if you want to filter documents by geographical location, each document would need a latitude and a longitude from which you would measure the distance to a desired location:

  0 require 'rubygems'
  1 require 'ferret'
  2 index = Ferret::I.new()
  3 index << {:latitude => 100.0, :longitude => 100.0, :f => "close"}
  4 index << {:latitude => 120.0, :longitude => 120.0, :f => "to far"}
  5 index << {:latitude => 110.0, :longitude => 110.0, :f => "close"}
  6 index << {:latitude => 120.0, :longitude => 100.0, :f => "close"}
  7 index << {:latitude => 100.0, :longitude => 120.0, :f => "close"} 
  8 
  9 def make_distance_proc(latitude, longitude, limit) 
 10   Proc.new do |doc_id, score, searcher|
 11     distance_2 = (searcher[doc_id][:latitude].to_f - latitude) ** 2 +
 12                  (searcher[doc_id][:longitude].to_f - longitude) ** 2
 13     limit_2 = limit ** 2
 14     next limit_2 >= distance_2
 15   end
 16 end
 17 
 18 filter_proc =  make_distance_proc(100.0, 100.0, 20.0)
 19 index.search_each("*", :filter_proc => filter_proc) do |doc_id, score|
 20   puts "location is #{index[doc_id][:f]}"
 21end

The first seven lines are just setting up the index with test data. The make_distance_proc method on line 9 creates a Proc that will check if a document falls within limit kilometers of the locations specified by the latitude and longitude parameters. We simply pass this Proc to the search_each method via the :filter_proc parameter.

Although it is called :filter_proc, you aren’t restricted to using this parameter for filtering search results. One nifty thing you can do with a :filter_proc is group results from the result set:

  0 require 'rubygems'
  1 require 'ferret'
  2 index = Ferret::I.new()
  3 index << {:value => 1, :data => "one"}
  4 index << {:value => 2, :data => "2"}
  5 index << {:value => 3, :data => "3.0"}
  6 index << {:value => 1, :data => "1.0"}
  7 index << {:value => 3, :data => "three"}
  8 index << {:value => 2, :data => "2.0"}
  9 index << {:value => 1, :data => "1"} 
 10 
 11 results = {}
 12 group_by_proc = lambda do |doc_id, score, searcher| 
 13   doc = searcher[doc_id]
 14   (results[doc[:value]]||=[]) << doc[:data]
 15   next true
 16 end
 17 
 18 index.search("*", :filter_proc => group_by_proc)
 19 putsresults.inspect

Again, the first nine lines just set up the index with test data. The group_by_proc created on line 12 is the interesting part, grouping documents by the :value field and adding the :data field to the results Hash. Obviously, this is just a silly example to demonstrate how the :filter_proc works. This is easily extensible to much more interesting problems.

Sorting Search Results

By default, documents are sorted by relevance and then by document ID if scores are equal. But what if we want to sort the result set by the value in one of the fields (e.g., price)? One way to do this is to retrieve the entire result set and make use of Ruby’s Array#sort method. However, this would take too long for large result sets, not to mention use up a lot of unnecessary memory. Searcher provides a :sort parameter for easy sorting. The easiest way to specify a sort is to pass a sort string. A sort string is a comma-separated list of field names with an optional DESC modifier to reverse the sort for that field. The type of the field is automatically detected and the field sorted accordingly. So Float fields will be sorted by Float value, and Integer fields will be sorted by Integer value. SCORE and DOC_ID can be used in place of field names to sort by relevance and internal document ID, respectively. Here are some examples:

index.search(query, :sort => "title, year DESC")
index.search(query, :sort => "SCORE DESC, DOC_ID DESC")
index.search(query, :sort => "SCORE, rating DESC")

Although this will do the job most of the time, you can be a little more explicit in describing how a result set is sorted by using the Sort API. You will also need to use the Sort API to take full advantage of sort caching. There are two classes in the Sort API: Sort and SortField.

SortField

A SortField describes how a particular field should be sorted. To create a SortField, you need to supply a field name and a sort type. You can also optionally reverse the sort. Table 4-2 shows the available sort types. Note that sort types are identified by Symbols.

Table 4-2. Sort types
Sort typeDescription
:auto The default type used when we supply a string sort descriptor. Ferret will look at the first term in the field’s index to detect its type. It will sort the field either by integer, float, or string depending on that first term’s type. Be careful when using :auto to sort fields that have numbers in them. If, for example, you are sorting a field with television show titles, “24” would probably be the first term in the index, making Ferret think that the field is an integer field.
:integer Converts every term in the field to an integer and sorts by those integers.
:float Converts every term in the field to a float and sorts by those floats.
:string Performs a locale-sensitive sort on the field. You need to make sure you have your locale set correctly for this to work. If the locale is set to ASCII or ISO-8859-1 and the field is encoded in UTF-8, the field will be incorrectly sorted.
:byte Sorts terms by the order they appear in the index. This will work perfectly for ASCII data and is a lot faster than a string sort.
:doc_id Sorts documents by their internal document ID. For this type of SortField, a field name is not necessary.
:score Sort documents by their relevance. This is how documents are sorted when no sort is specified. For this type of SortField, a field name is not necessary.

The SortField class also has four constant SortField objects:

  • SortField::SCORE

  • SortField::DOC_ID

  • SortField::SCORE_REV

  • SortField::DOC_ID_REV

With these constants available, you generally won’t ever need to create a SortField with the type :score or :doc_id. Here are some examples of how to create SortFields:

title_sort = SortField.new(:title, :type => :string)
path_sort = SortField.new(:path, :type => :byte)
rating_sort = SortField.new(:rating, :type => :float, :reverse => true)

Sort

The Sort object is used to hold SortFields in order of precedence to sort a result set. It is relatively straightforward to use. It also allows you to completely reverse all SortFields in one go (so already reversed fields will be reversed back to normal). Here are a couple of examples:

title_sort = SortField.new(:title, :type => :string)
path_sort = SortField.new(:path, :type => :byte)
rating_sort = SortField.new(:rating, :type => :float, :reverse => true)

sort = Sort.new([title_sort, rating_sort, SortField::SCORE])
top_docs = index.search(query, :sort => sort)

# reverse all sort-fields.
sort = Sort.new([path_sort, SortField::DOC_ID_REV], true)
top_docs = index.search(query, :sort => sort)

The Sort class also has two constants: Sort::RELAVANCE and Sort::INDEX_ORDER. Sort::RELAVANCE will order fields by score as is done by default in Ferret. Sort::INDEX_ORDER sorts a result set to the order in which the documents were added to the index.

Sorting by Date

Possibly one of the most common sorts to perform is a sort by date. We discussed how to store date fields for sorting in the Date Fields” section in Chapter 2. If you have stored the date field correctly (in YYYYMMDD format), it is very simple to sort by this field. The best sort type to use is :byte because it will be the fastest to create the index and otherwise performs just as well as aninteger sort. Using :auto, Ferret will sort the field by integer, which will be fine as well, so it is no problem using the sort string descriptor (e.g., “updated_on, created_on, DESC”). Here is how you would explicitly create a Sort to sort a date field:

updated_on = SortField.new(:updated_on, :type => :byte)
created_on = SortField.new(:created_on, :type => :byte, :reverse => true)
sort = Sort.new([updated_on, created_on, SortField::DOC_ID])
index.search(query, :sort => sort)

Highlighting Query Results

Query highlighting, like excerpting, is one of the newer features in Ferret, added in version 0.10. Highlighting takes a query and returns the data from a document field with all of the matches in the field highlighted. Excerpting, on the other hand, takes excerpts from the field, preferably with matching terms, and highlights the terms in those excerpts. Both Ferret::Search::Searcher and Ferret::Index::Index classes have a highlight method. In this section, we’ll look at Index#highlight because it allows us to pass string queries instead of having to build Query objects (see Table 4-3). Otherwise, both methods are essentially the same. To use the highlight method, you must supply a query and the document ID of the document you wish to highlight. A number of other parameters can be used to describe exactly how you want to highlight the field.

Table 4-3. Index#highlight parameters
ParameterDescription
:field Defaults to @options[:default_field]. The highlighter only works on one field at a time, so you need to specify which field it is you want to highlight. If you want to highlight multiple fields, you’ll need to call this method multiple times.
:excerpt_length Defaults to 150 bytes. This parameter specifies the length of excerpt to show. The algorithm for extracting excerpts attempts to fit as many matched terms into each excerpt as possible. If you’d simply like the complete field back with all matches highlighted, set this parameter to :all.
:num_excerpts Specifies the number of excerpts you wish to retrieve. This defaults to 2, unless :excerpt_length is set to :all, in which case :num_excerpts is automatically set to 1.
:pre_tag To highlight matches, you need to specify short strings to place before and after matches. :pre_tag defaults to <b>, which is fine when printing HTML, but if you are printing results to the console, we recommend using something like \033[36m.
:post_tag Defaults to </b>. This tag should close whatever you specified in :pre_tag. Try tag \033[m for console applications.
:ellipsis Defaults funnily enough to .... This is the string that is appended at the beginning and end of excerpts where the excerpts break in the middle of a field. Alternatively, you may want to use the HTML entity &#8230; or the UTF-8 string \342\200\246.

The highlight method returns an array of strings, the strings being the extracted excerpts. Example 4-1 demonstrates the flexibility of Ferret’s highlighting. We store the optional parameters in a hash to avoid specifying them for each call to the highlight method. We also use a StemmingAnalyzer to demonstrate that phrases don’t need to be exact to match. Don’t worry about how this works just yet. You’ll learn more about analysis in the next chapter.

Example 4-1. Query highlighter
require 'rubygems'
require 'ferret'

class MyAnalyzer < Ferret::Analysis::StandardAnalyzer
  def token_stream(field, input)
    Ferret::Analysis::StemFilter.new(super)
  end
end

index = Ferret::I.new(:analyzer => MyAnalyzer.new)

index << {
  :title => "Mark Twain Excerpts",
  :content => <<-EOF
 If it had not been for him, with his incendiary "Early to bed and
 early to rise," and all that sort of foolishness, I wouldn't have
 been so harried and worried and raked out of bed at such unseemly
 hours when I was young. The late Franklin was well enough in his
 way; but it would have looked more dignified in him to have gone on
 making candles and letting other people get up when they wanted to.
 - Letter from Mark Twain, San Francisco Alta California, July 25, 1869 

 When one receives a letter from a great man for the first time in
 his life, it is a large event to him, as all of you know by your own
 experience. You never can receive letters enough from famous men
 afterward to obliterate that one, or dim the memory of the pleasant
  surprise it was, and the gratification it gave you.
   - Mark Twain's Speeches, "Unconscious Plagiarism"
EOF
}

options = {
  :field => :content,
  :pre_tag => "\033[36m",
  :post_tag => "\033[m",
  :ellipsis => " \342\200\246 "
}
query = '"Early <> Bed" "receive letter"~1 Twain early'

puts "_" * 60 + "\n\t*** Extract two excerpts ***\n\n"
puts index.highlight(query, 0, options)

puts "_" * 60 + "\n\t*** Extract four smaller excerpts ***\n\n"
options[:num_excerpts] = 4
options[:excerpt_length] = 50
puts index.highlight(query, 0, options)

puts "_" * 60 + "\n\t*** Highlight the entire field ***\n\n"
options[:excerpt_length] = :all
puts index.highlight(query, 0, options)

You’ll notice here that the second example that’s supposed to extract four excerpts of length 50 bytes actually extracts two excerpts of 50 bytes and one of 100 bytes. The excerpting algorithm works by attempting to place the excerpts so that the maximum number of matched terms will be shown. If it can concatenate two or more excerpts without reducing the number of matched terms shown, it will.

Summary

Now that we’ve covered Ferret’s search API, you should know how and when to use Queries and Filters, the pros and cons of each, and when to take advantage of Ferret’s QueryParser. You’ve learned how to sort your result sets and what to do about sorting performance problems. You should now be able to design the search feature of your application to best suit the needs of your users, keeping in mind the resources used by different queries, filters, and sorts.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required