We’ve already mentioned Filters
in
our discussion of ConstantScoreQuery
and FilteredQuery
. Filters
are used to apply extra constraints to a result set. For example, we want
to restrict our search to documents that were created during the last
month. We have two options: add a RangeQuery
clause to our query, or apply a
RangeFilter
. The main advantage of
using a Filter
over a Query
is that no score is taken into
account, so a Filter
can be a lot faster. To add to
that, Filters
cache their results so that subsequent
uses of the Filter
perform even better again. All
caching is done against an instance of an IndexReader
, so a new cache needs to be
built each time a Filter
is used against a different
IndexReader
.
Filters
also make it easy to apply constraints to
user input queries. Filters
are best used when applying
commonly used constraints to a user’s query, such as restricting a search
of a blog to only today’s postings or only to postings marked for publication.
There are only two standard Filters
that come
with Ferret:
RangeFilter
QueryFilter
RangeFilter
takes the same parameters as RangeQuery
as
described in the RangeQuery”
section earlier in this chapter. Basically, you need to supply a
:field
and an upper and/or lower
limit for that field. For example, if you want to restrict a search to products
that are priced at $50.00 or more and less than $100.00, we would build
the filter like this:
price_filter
=
RangeFilter
.
new
(
:price
,
:>=
=>
"
050.00
",
:<
=>
"
100.00
")
Note again the way we padded the price values.
RangeFilter
works only on fields that are correctly
lexically sorted, so you need to remember to pad all number fields to a
fixed width if you want to filter that field with a
RangeFilter
.
QueryFilter
makes use of a query to filter
search results. The initial application of a
QueryFilter
will be just as slow as if you added the
filter query as a :must
clause to the
actual query. However, after caching, subsequent use of the
QueryFilter
will be much faster.
A good example of where you might use a
QueryFilter
is to restrict a search to only published
articles in a CMS (Content Management System). You would create the
filter like this:
published_filter
=
QueryFilter
.
new
(
TermQuery
.
new
(
:state
,
"
published
"))
Remember that to take full advantage of the
Filter
properties you should only create this filter
once and keep a handle to it. Don’t create a new QueryFilter
every time the search
method is invoked.
Writing your own filter turns out to be pretty easy. All you need
to do is implement a bits
method,
which takes an IndexReader
and
returns a BitVector
. The best way to
explain this is with an example. Let’s build a
RangeFilter
that works for floats that haven’t been
padded to fixed width:
0
require
'
rubygems
'
1
require
'
ferret
'
2
3
class
FloatRangeFilter
4
attr_accessor
:field
,
:upper
,
:lower
,
:upper_op
,
:lower_op
5
6
def
initialize
(
field
,
options
)
7
@field
=
field
8
@upper
=
options
[:<]
||
options
[:<=]
9
@lower
=
options
[:>]
||
options
[:>=]
10
if
@upper
.
nil?
and
@lower
.
nil?
11
raise
ArgError
,
"
Must specify a bound
"
12
end
13
@upper_op
=
options
[:<].
nil?
?
:<=
:
:<
14
@lower_op
=
options
[:>].
nil?
?
:>=
:
:>
15
end
16
17
def
bits
(
index_reader
)
18
bit_vector
=
Ferret
::
Utils
::
BitVector
.
new
19
term_doc_enum
=
index_reader
.
term_docs
20
index_reader
.
terms
(
@field
).
each
do
|
term
,
freq
|
21
float
=
term
.
to_f
22
next
if
@upper
and
not
float
.
send
(
@upper_op
,
@upper
)
23
next
if
@lower
and
not
float
.
send
(
@lower_op
,
@lower
)
24
term_doc_enum
.
seek
(
@field
,
term
)
25
term_doc_enum
.
each
{|
doc_id
,
freq
|
bit_vector
.
set
(
doc_id
)}
26
end
27
return
bit_vector
28
end
29
30
def
hash
31
return
@field
.
hash
^@upper
.
hash
^@lower
.
hash
^32
@upper_op
.
hash
^@lower_op
.
hash
33
end
34
35
def
eql?
(
o
)
36
return
(
o
.
instance_of?
(
FloatRangeFilter
)
and
@field
==
o
.
field
and
37
@upper
==
o
.
upper
and
@lower
==
o
.
lower
and
38
@upper_op
==
o
.
upper_op
and
@lower_op
==
o
.
lower_op
)
39
end
40
end
You instantiate this by passing a field name and one or two of the
optional parameters (:<
, :<=
, :>
, and :>=
) used to specify the bounds. These
optional parameters should be Floats
. The most
important method in this class is the bits
method. Starting from line 20, it iterates through all the terms in
the specified field, converts the term to a Float
,
and checks that it is in the required range.
There is a little bit of trickiness on lines 22 and 23 where we are checking that the term
is within the required range. f.send(@upper_op,
@upper)
translates either to f <
@upper
or to f <=
@upper
, depending on which of the less-than parameters
(:<
or :<=
) were passed. @upper_op
gets set on line 13.
Once we know that the term falls within the required range, the
next step is to fill in the bits in the BitVector
for
all the documents in which that term appears. We do this on line 25
using a TermDocEnum
. The final
BitVector
has a bit set for every document in the
index that has a term in the specified field within the required
floating-point range.
Using our new custom filter is simple. Simply pass it as the
:filter
parameter:
filter
=
FloatRangeFilter
.
new
(
:price
,
:<
=>
100.0
,
:>=
=>
10.0
)
searcher
.
search_each
("
*
",
:filter
=>
filter
)
do
|
d
,
s
|
puts
"
price =>
#{searcher[d][:price]}
"
end
In this example, we would get all products with a price of $10.00 or more and less than $100.00.
The :filter_proc
parameter
of the Searcher#search
methods is one of the more recent additions to the Ferret arsenal. It
enables you to do a lot of things that were impossible with only
Filter
objects. Basically, you supply a Proc
object that gets called for
every result in the result set. The Proc
object takes
three parameters: a document ID, a score, and the
Searcher
object. So, if you want to filter documents
by geographical location, each document would need a latitude and a
longitude from which you would measure the distance to a desired
location:
0
require
'
rubygems
'
1
require
'
ferret
'
2
index
=
Ferret
::
I
.
new
()
3
index
<<
{
:latitude
=>
100.0
,
:longitude
=>
100.0
,
:f
=>
"
close
"}
4
index
<<
{
:latitude
=>
120.0
,
:longitude
=>
120.0
,
:f
=>
"
to far
"}
5
index
<<
{
:latitude
=>
110.0
,
:longitude
=>
110.0
,
:f
=>
"
close
"}
6
index
<<
{
:latitude
=>
120.0
,
:longitude
=>
100.0
,
:f
=>
"
close
"}
7
index
<<
{
:latitude
=>
100.0
,
:longitude
=>
120.0
,
:f
=>
"
close
"}
8
9
def
make_distance_proc
(
latitude
,
longitude
,
limit
)
10
Proc
.
new
do
|
doc_id
,
score
,
searcher
|
11
distance_2
=
(
searcher
[
doc_id
][
:latitude
].
to_f
-
latitude
)
**
2
+
12
(
searcher
[
doc_id
][
:longitude
].
to_f
-
longitude
)
**
2
13
limit_2
=
limit
**
2
14
next
limit_2
>=
distance_2
15
end
16
end
17
18
filter_proc
=
make_distance_proc
(
100.0
,
100.0
,
20.0
)
19
index
.
search_each
("
*
",
:filter_proc
=>
filter_proc
)
do
|
doc_id
,
score
|
20
puts
"
location is
#{index[doc_id][:f]}
"
21
end
The first seven lines are just setting up the index with test
data. The make_distance_proc
method on line 9 creates a Proc
that will
check if a document falls within
limit
kilometers of the locations
specified by the latitude
and longitude
parameters. We simply pass this
Proc
to the search_each
method via the :filter_proc
parameter.
Although it is called :filter_proc
, you aren’t restricted to using
this parameter for filtering search results. One nifty thing you can do
with a :filter_proc
is group results from the result set:
0
require
'
rubygems
'
1
require
'
ferret
'
2
index
=
Ferret
::
I
.
new
()
3
index
<<
{
:value
=>
1
,
:data
=>
"
one
"}
4
index
<<
{
:value
=>
2
,
:data
=>
"
2
"}
5
index
<<
{
:value
=>
3
,
:data
=>
"
3.0
"}
6
index
<<
{
:value
=>
1
,
:data
=>
"
1.0
"}
7
index
<<
{
:value
=>
3
,
:data
=>
"
three
"}
8
index
<<
{
:value
=>
2
,
:data
=>
"
2.0
"}
9
index
<<
{
:value
=>
1
,
:data
=>
"
1
"}
10
11
results
=
{}
12
group_by_proc
=
lambda
do
|
doc_id
,
score
,
searcher
|
13
doc
=
searcher
[
doc_id
]
14
(
results
[
doc
[
:value
]]||=[])
<<
doc
[
:data
]
15
next
true
16
end
17
18
index
.
search
("
*
",
:filter_proc
=>
group_by_proc
)
19
puts
results
.
inspect
Again, the first nine lines just set up the index with test data.
The group_by_proc
created on line 12 is the interesting part, grouping documents
by the :value
field and adding the
:data
field to the results
Hash
. Obviously,
this is just a silly example to demonstrate how the :filter_proc
works. This
is easily extensible to much more interesting problems.
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.