You’ve now been introduced to all the different types of queries
available in Ferret, and you’ve learned how to build different queries by
hand. Some of it probably seems like a lot of work and it’s certainly not
something you’d ask a user to do. Luckily, we can leave most of the work
to the Ferret QueryParser
. You’ve
already seen many examples of the Ferret Query Language (FQL) in the
previous section (Building
Queries”),
and you’ll have noticed that most of the queries you can build in code can
be described much more easily in FQL. In this section, we’ll talk about
setting up the QueryParser
, and then
we’ll go into more detail about FQL.
The QueryParser
has a number of
parameters, as shown in Table 4-1.
Table 4-1. QueryParser parameters
Parameter | Default | Short description |
---|---|---|
:default_field
|
:*
| The default field to be searched; it can also be an array. |
:analyzer
|
StandardAnalyzer
| Analyzer used by the query parser to parse query terms. |
:wild_card_downcase
|
true
| Specifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser. |
:fields
|
[]
| Lets the query parser know what fields are available for
searching, particularly when the :* is specified as the search
field. |
:validate_fields
|
false
| Set to true if you
want an exception to be raised if there is an attempt to search
a nonexistent field. |
:or_default
|
true
| Use OR as the default
Boolean operator. |
:default_slop
| 0 | Default slop to use in
PhraseQueries . |
:handle_parser_errors
|
true
| QueryParser will
quietly handle all parsing errors internally. If you’d like
to handle them yourself,
set this parameter to false . |
:clean_string
|
true
| QueryParser will
quickly review the query string to make sure that quotes and
brackets match up and special characters are escaped. |
:max_clauses
| 512 | The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries. |
The first thing you need to think about when setting up the
QueryParser
is which analyzer to use. Preferably, you should use the same
analyzer you used to tokenize your documents during indexing. This
analyzer will be used to analyze all terms before they are added to
queries, except in the case of wildcard queries, since they’ll contain
*
and ?
, which many analyzers won’t accept. Because
of this, you’ll probably need to lowercase the wildcard query if the
analyzer you used was a lowercasing analyzer. The exception to this rule
is the use of wildcard queries on fields that are untokenized, in which
case you might want to leave them as case-sensitive. To specify whether
or not wildcard queries are lowercased, you need to set the parameter
:wild_card_downcase
. It is set to
true
by default.
The next thing you need to worry about is document fields. First
of all, which fields are available to be searched? When the user
specifies the field he wants to search, he can use an *
to search all fields. For this to work, you
need to set up the QueryParser
so
that it knows which fields are available. Simply set the parameter
:fields
to an array of field names.
You can get the list of available field names from an IndexReader
:
query
=
QueryParser
.
new
(
:fields
=>
reader
.
fields
,
:tokenized_fields
=>
reader
.
tokenized_fields
)
The :fields
parameter can be
either a Symbol
or an Array
of Symbols
. You can also set the QueryParser
to validate all queries that use
these fields. That is, each time a user selects a field to search, the
query parser will check that that field is present in the @fields
attribute, and if it isn’t, it will
raise an exception. So, if your index has a :title
and a :content
field and the user tries to search in
a :contetn
field (note the
misspelling), the QueryParser
will
raise an exception. To make the QueryParser
validate fields, you need to set
the :validate_fields
parameter to true
. It is set to false
by default.
Once you have specified which fields are available, you need to
designate which of those fields you want to be searched by default.
Simply set the parameter :default_field
to a single field name or an
array of field names. You can even set it to the symbol :*
, which will specify that you want to search
all fields by default. :*
is in fact
the default value.
Next, you must decide if you want Boolean queries to be OR or AND
by default. This involves setting :or_default
to true
or false
. By default, it is set to true
, but if you want to make your search more
like a regular search engine, you should set it to false
.
The QueryParser
handles parse
errors for you by default. It does this by trying to parse the query
according to the grammar. If that fails, it tries to parse it into a
simple term or Boolean query, ignoring all query language tokens. If it
still can’t do that, it will return an empty Boolean query. No exception
will be thrown and the user will just see an empty set of results. If
you’d like to handle the parse errors yourself, you can set the
parameter :handle_parse_errors
to
false
. You can then let the user know
that the query she entered was invalid.
Also, to make QueryParser
more
robust, it has a clean_string
method that basically makes sure brackets and quotes match up
and that all special characters within phrase strings are properly
escaped. For example, the following query:
(city:Braidwood AND shop:(Torpy's OR "Pig & Whistle
will be cleaned up as:
(city:Braidwood AND shop:(Torpy's OR "Pig\
& Whistle"))
Perhaps you want to clean the query strings yourself or you would
prefer to have an exception raised if the query can’t be parsed. To do
this, set the :clean_string
parameter
to false
.
Because MultiTermQueries
have a :max_terms
property, you can set the default
value used for :max_terms
by the
query parser by setting its :max_clauses
parameter. This will also affect
the maximum number of clauses you can add to a BooleanQuery
.
The Ferret Query Language allows you to build most of the queries that you can build with Ruby code using just a simple query string. For simple queries, it matches what users have come to expect from years of using different search engines. But FQL allows you to build much more diverse queries than the usual search engine queries allow. FQL aims to be as concise as possible while still being readable and hopefully obvious to most users.
To express the simplest of all queries in FQL, simply type the term you wish to find:
'ferret'
When parsed by the QueryParser
, this string will be translated
to a query that will search for the term “ferret” in all fields
specified with :default_fields
parameter.
To constrain your search to a field other than the field(s)
specified by :default_fields
,
prefix your search with the field name followed by a colon. For
example, if you want to search the :title
field for the term “Ruby”, you would
do so like this:
'title:Ruby'
Searching multiple fields is easy, too. Simply separate field
names with a |
character. So, to
search :title
and :content
fields for “Ruby”, you would
type:
'title|content:Ruby'
You can match all fields with the *
character:
'*:Ruby'
That’s all there is to specifying the field to search. If you
want to search for documents that contain the term Ruby in both the
:title
and :content
fields, you will need to use a
Boolean query.
Most readers have used Boolean queries in search applications
before. The most common syntax makes use of the +
and –
characters, +
indicating terms that
must occur and –
indicating terms that must
not occur. So, to search for documents on “Ferret” that
preferably have the term “Ruby” and must not have the term “pet”, you
would type the following query:
'+Ferret Ruby -pet'
+
and –
can also be rewritten as “REQ” and “NOT”,
respectively. Ferret also supports the “AND” and “OR” keywords. “AND” has precedence over “OR”, but this can
be overridden with the use of parentheses: (
and )
.
So, to search for a chocolate or caramel sundae, you’d type:
'(chocolate OR caramel) AND sundae'
This could also be written as:
'+(chocolate caramel) +sundae'
It’s just a matter of personal preference.
Field constraints can be applied to individual terms or whole Boolean queries wrapped in brackets:
'+flavour:(chocolate caramel) +name:sundae'
Inner field constraints override outer field constraints, so the following is equivalent to the previous query:
'name:(flavour:(chocolate caramel) AND sundae)'
As you would expect, in FQL phrase queries are identified by "
characters. So, to search for the phrase “quick brown fox”, your query
would be just that:
'"quick brown fox"'
But Ferret phrase queries offer a lot more. You can specify a list of options for a term in a phrase. Let’s say we don’t care if the fox is “red”, “orange”, or “brown”. You could search for the following phrase:
'"quick red|orange|brown fox"'
We could even accept absolutely anything in a term’s position. For example, the following would match “quick hungry fox”:
'"quick <> fox"'
In the PhraseQuery” section earlier in this
chapter, we also discussed sloppy phrase queries. The phrase
slop can be indicated using the ~
character followed by an integer slop
value. For example, the following query would match the phrase “quick
brown and white fox”:
'"quick fox"~3'
As with other types of query, phrase queries can have field constraints applied to them:
'content|title:"quick fox"~3'
Range queries can be specified in a couple of different ways.
The []
and {}
brackets represent inclusive and
exclusive limits, respectively. This syntax is inherited from the
Apache Lucene query syntax. Let’s say you want all documents created
on or after the 25th of April, 2006, and before the 11th of November
(but not on that day). You would specify the query like this:
'created_on:[20060425 20061111}'
In FQL, you can also express upper and lower bounded range
queries. The open bounds are identified by the >
and <
tokens. For example, if I want all
documents created after the 25th of July, 1977, I would write the
query like this:
'created_on:[20060725>'
To find all documents created before that date, you could type this:
'created_on:<20060725}'
Alternatively, you can use the >
, <
, >=
, and <=
tokens to specify singly bounded range
queries. The previous two queries would be, respectively:
'created_on:>= 20060725'
'created_on:< 20060725'
Wildcard queries in Ferret make use of the *
and ?
characters. Just to reiterate what we covered in the WildcardQuery” section earlier in this chapter, *
will match any number of characters,
whereas ?
will match a single
character only. So, the following query will match all documents with
the terms “lend”, “legend”, or
“lead” in the :content
field:
'content:l*e?d'
We can also use wildcard query syntax to create a few other
types of queries. For example,
to create a MatchAllQuery
, you would type:
'*'
Note that it makes no difference if we add a field constraint to this query. To find all documents with a price field, you might be tempted to type the following:
'price:*'
But this will match all documents. Instead, you need to type the following:
'price:?*'
You can also create prefix queries using wildcard syntax. Simply
type the prefix and append *
to the
end. For example:
'category:/programming/ruby/*'
This query will be optimized into a
PrefixQuery
by the QueryParser
.
In the FuzzyQuery” section earlier in this chapter, we said that FuzzyQueries
are to TermQueries
as sloppy PhraseQueries
are to standard PhraseQueries
, so it should come as no
surprise that FuzzyQueries
use
the same syntax as sloppy PhraseQueries
. Instead of a “slop”
integer, however, we have a “similarity” float, which must be between
0.0 and 1.0. Another difference is that FuzzyQueries
have a default similarity
of 0.5, so you don’t need to specify a similarity value at all. Let’s
say, for example, that we wish to find all documents containing the commonly misspelled word
“mischievous”:
'mischievous~'
Or we could make the query more strict by increasing the similarity value, like this:
'mischievous~0.8'
Remember that FuzzyQueries
are expensive
queries to use on a large index, so you may want to set the default
prefix length as described at the end of the FuzzyQuery” section.
Boosting queries in FQL is a simple matter of appending the query with ^
and a boost value (see the Boosting Queries” section earlier in this chapter). For
example, let’s go back to our Boolean search for “Ferret” where the
results included the term “Ruby” but not the term “pet”. In this case,
“Ferret” is the most important term, so we should boost it:
'+Ferret^10.0 Ruby^0.1 -pet'
Note that it makes no sense to boost negative clauses in a
boolean query. We should also note that the boost comes after the slop
in a sloppy PhraseQuery
and the similarity
in a FuzzyQuery
:
'"quick brown fox"~5^10.0 AND date:>=20060601'
'mischievous~0.8^10.0 AND date:>=20060601'
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.