Filtering Search Results

We’ve already mentioned Filters in our discussion of ConstantScoreQuery and FilteredQuery. Filters are used to apply extra constraints to a result set. For example, we want to restrict our search to documents that were created during the last month. We have two options: add a RangeQuery clause to our query, or apply a RangeFilter. The main advantage of using a Filter over a Query is that no score is taken into account, so a Filter can be a lot faster. To add to that, Filters cache their results so that subsequent uses of the Filter perform even better again. All caching is done against an instance of an IndexReader, so a new cache needs to be built each time a Filter is used against a different IndexReader.

Filters also make it easy to apply constraints to user input queries. Filters are best used when applying commonly used constraints to a user’s query, such as restricting a search of a blog to only today’s postings or only to postings marked for publication.

There are only two standard Filters that come with Ferret:

RangeFilter
QueryFilter

Using the RangeFilter

RangeFilter takes the same parameters as RangeQuery as described in the RangeQuery” section earlier in this chapter. Basically, you need to supply a :field and an upper and/or lower limit for that field. For example, if you want to restrict a search to products that are priced at $50.00 or more and less than $100.00, we would build the filter like this:

price_filter = RangeFilter.new(:price, :>= => "050.00", :< => "100.00")

Note again the way we padded the price values. RangeFilter works only on fields that are correctly lexically sorted, so you need to remember to pad all number fields to a fixed width if you want to filter that field with a RangeFilter.

Using the QueryFilter

QueryFilter makes use of a query to filter search results. The initial application of a QueryFilter will be just as slow as if you added the filter query as a :must clause to the actual query. However, after caching, subsequent use of the QueryFilter will be much faster.

A good example of where you might use a QueryFilter is to restrict a search to only published articles in a CMS (Content Management System). You would create the filter like this:

published_filter = QueryFilter.new(TermQuery.new(:state, "published"))

Remember that to take full advantage of the Filter properties you should only create this filter once and keep a handle to it. Don’t create a new QueryFilter every time the search method is invoked.

Writing Your Own Filter

Writing your own filter turns out to be pretty easy. All you need to do is implement a bits method, which takes an IndexReader and returns a BitVector. The best way to explain this is with an example. Let’s build a RangeFilter that works for floats that haven’t been padded to fixed width:

  0 require 'rubygems'
  1 require 'ferret'
  2 
  3 class FloatRangeFilter
  4   attr_accessor :field, :upper, :lower, :upper_op, :lower_op
  5 
  6   def initialize(field, options)
  7     @field = field
  8     @upper = options[:<] || options[:<=]
  9     @lower = options[:>] || options[:>=]
 10     if @upper.nil? and @lower.nil?
 11       raise ArgError, "Must specify a bound"
 12     end
 13     @upper_op = options[:<].nil? ? :<= : :< 
 14     @lower_op = options[:>].nil? ? :>= : :> 
 15   end
 16 
 17   def bits(index_reader)
 18     bit_vector = Ferret::Utils::BitVector.new
 19     term_doc_enum = index_reader.term_docs
 20     index_reader.terms(@field).each do |term, freq| 
 21       float = term.to_f
 22       next if @upper and not float.send(@upper_op, @upper) 
 23       next if @lower and not float.send(@lower_op, @lower) 
 24       term_doc_enum.seek(@field, term)
 25       term_doc_enum.each {|doc_id, freq| bit_vector.set(doc_id)} 
 26     end
 27     return bit_vector
 28   end
 29 
 30   def hash
 31     return @field.hash ^ @upper.hash ^ @lower.hash ^
 32            @upper_op.hash ^ @lower_op.hash
 33   end
 34 
 35   def eql?(o)
 36     return (o.instance_of?(FloatRangeFilter) and @field == o.field and
 37             @upper == o.upper and @lower == o.lower and
 38             @upper_op == o.upper_op and @lower_op == o.lower_op)
 39   end
 40end

You instantiate this by passing a field name and one or two of the optional parameters (:<, :<=, :>, and :>=) used to specify the bounds. These optional parameters should be Floats. The most important method in this class is the bits method. Starting from line 20, it iterates through all the terms in the specified field, converts the term to a Float, and checks that it is in the required range.

There is a little bit of trickiness on lines 22 and 23 where we are checking that the term is within the required range. f.send(@upper_op, @upper) translates either to f < @upper or to f <= @upper, depending on which of the less-than parameters (:< or :<=) were passed. @upper_op gets set on line 13.

Once we know that the term falls within the required range, the next step is to fill in the bits in the BitVector for all the documents in which that term appears. We do this on line 25 using a TermDocEnum. The final BitVector has a bit set for every document in the index that has a term in the specified field within the required floating-point range.

Using our new custom filter is simple. Simply pass it as the :filter parameter:

filter = FloatRangeFilter.new(:price, :< => 100.0, :>= => 10.0)
searcher.search_each("*", :filter => filter) do |d, s|
  puts "price => #{searcher[d][:price]}"
end

In this example, we would get all products with a price of $10.00 or more and less than $100.00.

Filter Caching Explained

You could easily stop after implementing the bits method and everything would work as expected. However, to make good use of Filter caching, you should implement hash and eql? methods. Whenever the bits method of a filter is called with an IndexReader, the BitVector that is returned is cached for that IndexReader. So, the next time the filter is used with the same IndexReader, the cached BitVector is used. The cache is stored in a Hash, so you should implement hash and eql? methods in your custom Filter.

While we are on the topic, we should talk about the memory used by filters. If you have a million-document index, each cached BitVector is going to take 1,000,000/8 ~ 125 Kb of memory (one bit for each document in the index). Creating too many filters could lead to a memory problem. However, each cached BitVector will be destroyed when its corresponding filter or IndexReader is destroyed, so as long as you don’t keep references to old filters—keeping them from being garbage-collected—you shouldn’t have a problem.

:filter_proc, the New Filter

The :filter_proc parameter of the Searcher#search methods is one of the more recent additions to the Ferret arsenal. It enables you to do a lot of things that were impossible with only Filter objects. Basically, you supply a Proc object that gets called for every result in the result set. The Proc object takes three parameters: a document ID, a score, and the Searcher object. So, if you want to filter documents by geographical location, each document would need a latitude and a longitude from which you would measure the distance to a desired location:

  0 require 'rubygems'
  1 require 'ferret'
  2 index = Ferret::I.new()
  3 index << {:latitude => 100.0, :longitude => 100.0, :f => "close"}
  4 index << {:latitude => 120.0, :longitude => 120.0, :f => "to far"}
  5 index << {:latitude => 110.0, :longitude => 110.0, :f => "close"}
  6 index << {:latitude => 120.0, :longitude => 100.0, :f => "close"}
  7 index << {:latitude => 100.0, :longitude => 120.0, :f => "close"} 
  8 
  9 def make_distance_proc(latitude, longitude, limit) 
 10   Proc.new do |doc_id, score, searcher|
 11     distance_2 = (searcher[doc_id][:latitude].to_f - latitude) ** 2 +
 12                  (searcher[doc_id][:longitude].to_f - longitude) ** 2
 13     limit_2 = limit ** 2
 14     next limit_2 >= distance_2
 15   end
 16 end
 17 
 18 filter_proc =  make_distance_proc(100.0, 100.0, 20.0)
 19 index.search_each("*", :filter_proc => filter_proc) do |doc_id, score|
 20   puts "location is #{index[doc_id][:f]}"
 21end

The first seven lines are just setting up the index with test data. The make_distance_proc method on line 9 creates a Proc that will check if a document falls within limit kilometers of the locations specified by the latitude and longitude parameters. We simply pass this Proc to the search_each method via the :filter_proc parameter.

Although it is called :filter_proc, you aren’t restricted to using this parameter for filtering search results. One nifty thing you can do with a :filter_proc is group results from the result set:

  0 require 'rubygems'
  1 require 'ferret'
  2 index = Ferret::I.new()
  3 index << {:value => 1, :data => "one"}
  4 index << {:value => 2, :data => "2"}
  5 index << {:value => 3, :data => "3.0"}
  6 index << {:value => 1, :data => "1.0"}
  7 index << {:value => 3, :data => "three"}
  8 index << {:value => 2, :data => "2.0"}
  9 index << {:value => 1, :data => "1"} 
 10 
 11 results = {}
 12 group_by_proc = lambda do |doc_id, score, searcher| 
 13   doc = searcher[doc_id]
 14   (results[doc[:value]]||=[]) << doc[:data]
 15   next true
 16 end
 17 
 18 index.search("*", :filter_proc => group_by_proc)
 19 putsresults.inspect

Again, the first nine lines just set up the index with test data. The group_by_proc created on line 12 is the interesting part, grouping documents by the :value field and adding the :data field to the results Hash. Obviously, this is just a silly example to demonstrate how the :filter_proc works. This is easily extensible to much more interesting problems.

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Ferret by David Balmain

Filtering Search Results

Using the RangeFilter

Using the QueryFilter

Writing Your Own Filter

:filter_proc, the New Filter

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly