QueryParser

You’ve now been introduced to all the different types of queries available in Ferret, and you’ve learned how to build different queries by hand. Some of it probably seems like a lot of work and it’s certainly not something you’d ask a user to do. Luckily, we can leave most of the work to the Ferret QueryParser. You’ve already seen many examples of the Ferret Query Language (FQL) in the previous section (Building Queries”), and you’ll have noticed that most of the queries you can build in code can be described much more easily in FQL. In this section, we’ll talk about setting up the QueryParser, and then we’ll go into more detail about FQL.

Setting Up the QueryParser

The QueryParser has a number of parameters, as shown in Table 4-1.

Table 4-1. QueryParser parameters

ParameterDefaultShort description
:default_field :* The default field to be searched; it can also be an array.
:analyzer StandardAnalyzer Analyzer used by the query parser to parse query terms.
:wild_card_downcase true Specifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser.
:fields [] Lets the query parser know what fields are available for searching, particularly when the :* is specified as the search field.
:validate_fields false Set to true if you want an exception to be raised if there is an attempt to search a nonexistent field.
:or_default true Use OR as the default Boolean operator.
:default_slop 0Default slop to use in PhraseQueries.
:handle_parser_errors true QueryParser will quietly handle all parsing errors internally. If you’d like to handle them yourself, set this parameter to false.
:clean_string true QueryParser will quickly review the query string to make sure that quotes and brackets match up and special characters are escaped.
:max_clauses 512The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries.

The first thing you need to think about when setting up the QueryParser is which analyzer to use. Preferably, you should use the same analyzer you used to tokenize your documents during indexing. This analyzer will be used to analyze all terms before they are added to queries, except in the case of wildcard queries, since they’ll contain * and ?, which many analyzers won’t accept. Because of this, you’ll probably need to lowercase the wildcard query if the analyzer you used was a lowercasing analyzer. The exception to this rule is the use of wildcard queries on fields that are untokenized, in which case you might want to leave them as case-sensitive. To specify whether or not wildcard queries are lowercased, you need to set the parameter :wild_card_downcase. It is set to true by default.

The next thing you need to worry about is document fields. First of all, which fields are available to be searched? When the user specifies the field he wants to search, he can use an * to search all fields. For this to work, you need to set up the QueryParser so that it knows which fields are available. Simply set the parameter :fields to an array of field names. You can get the list of available field names from an IndexReader:

query = QueryParser.new(:fields => reader.fields,
                        :tokenized_fields => reader.tokenized_fields)

The :fields parameter can be either a Symbol or an Array of Symbols. You can also set the QueryParser to validate all queries that use these fields. That is, each time a user selects a field to search, the query parser will check that that field is present in the @fields attribute, and if it isn’t, it will raise an exception. So, if your index has a :title and a :content field and the user tries to search in a :contetn field (note the misspelling), the QueryParser will raise an exception. To make the QueryParser validate fields, you need to set the :validate_fields parameter to true. It is set to false by default.

Once you have specified which fields are available, you need to designate which of those fields you want to be searched by default. Simply set the parameter :default_field to a single field name or an array of field names. You can even set it to the symbol :*, which will specify that you want to search all fields by default. :* is in fact the default value.

Next, you must decide if you want Boolean queries to be OR or AND by default. This involves setting :or_default to true or false. By default, it is set to true, but if you want to make your search more like a regular search engine, you should set it to false.

The QueryParser handles parse errors for you by default. It does this by trying to parse the query according to the grammar. If that fails, it tries to parse it into a simple term or Boolean query, ignoring all query language tokens. If it still can’t do that, it will return an empty Boolean query. No exception will be thrown and the user will just see an empty set of results. If you’d like to handle the parse errors yourself, you can set the parameter :handle_parse_errors to false. You can then let the user know that the query she entered was invalid.

Also, to make QueryParser more robust, it has a clean_string method that basically makes sure brackets and quotes match up and that all special characters within phrase strings are properly escaped. For example, the following query:

(city:Braidwood AND shop:(Torpy's OR "Pig & Whistle

will be cleaned up as:

(city:Braidwood AND shop:(Torpy's OR "Pig \& Whistle"))

Perhaps you want to clean the query strings yourself or you would prefer to have an exception raised if the query can’t be parsed. To do this, set the :clean_string parameter to false.

Because MultiTermQueries have a :max_terms property, you can set the default value used for :max_terms by the query parser by setting its :max_clauses parameter. This will also affect the maximum number of clauses you can add to a BooleanQuery.

Ferret Query Language

The Ferret Query Language allows you to build most of the queries that you can build with Ruby code using just a simple query string. For simple queries, it matches what users have come to expect from years of using different search engines. But FQL allows you to build much more diverse queries than the usual search engine queries allow. FQL aims to be as concise as possible while still being readable and hopefully obvious to most users.

TermQuery

To express the simplest of all queries in FQL, simply type the term you wish to find:

            'ferret'

When parsed by the QueryParser, this string will be translated to a query that will search for the term “ferret” in all fields specified with :default_fields parameter.

To constrain your search to a field other than the field(s) specified by :default_fields, prefix your search with the field name followed by a colon. For example, if you want to search the :title field for the term “Ruby”, you would do so like this:

            'title:Ruby'

Searching multiple fields is easy, too. Simply separate field names with a | character. So, to search :title and :content fields for “Ruby”, you would type:

            'title|content:Ruby'

You can match all fields with the * character:

            '*:Ruby'

That’s all there is to specifying the field to search. If you want to search for documents that contain the term Ruby in both the :title and :content fields, you will need to use a Boolean query.

BooleanQuery

Most readers have used Boolean queries in search applications before. The most common syntax makes use of the + and characters, + indicating terms that must occur and indicating terms that must not occur. So, to search for documents on “Ferret” that preferably have the term “Ruby” and must not have the term “pet”, you would type the following query:

            '+Ferret Ruby -pet'

+ and can also be rewritten as “REQ” and “NOT”, respectively. Ferret also supports the “AND” and “OR” keywords. “AND” has precedence over “OR”, but this can be overridden with the use of parentheses: ( and ). So, to search for a chocolate or caramel sundae, you’d type:

            '(chocolate OR caramel) AND sundae'

This could also be written as:

            '+(chocolate caramel) +sundae'

It’s just a matter of personal preference.

Field constraints can be applied to individual terms or whole Boolean queries wrapped in brackets:

            '+flavour:(chocolate caramel) +name:sundae'

Inner field constraints override outer field constraints, so the following is equivalent to the previous query:

            'name:(flavour:(chocolate caramel) AND sundae)'

PhraseQuery

As you would expect, in FQL phrase queries are identified by " characters. So, to search for the phrase “quick brown fox”, your query would be just that:

            '"quick brown fox"'

But Ferret phrase queries offer a lot more. You can specify a list of options for a term in a phrase. Let’s say we don’t care if the fox is “red”, “orange”, or “brown”. You could search for the following phrase:

            '"quick red|orange|brown fox"'

We could even accept absolutely anything in a term’s position. For example, the following would match “quick hungry fox”:

            '"quick <> fox"'

In the PhraseQuery” section earlier in this chapter, we also discussed sloppy phrase queries. The phrase slop can be indicated using the ~ character followed by an integer slop value. For example, the following query would match the phrase “quick brown and white fox”:

            '"quick fox"~3'

As with other types of query, phrase queries can have field constraints applied to them:

            'content|title:"quick fox"~3'

RangeQuery

Range queries can be specified in a couple of different ways. The [] and {} brackets represent inclusive and exclusive limits, respectively. This syntax is inherited from the Apache Lucene query syntax. Let’s say you want all documents created on or after the 25th of April, 2006, and before the 11th of November (but not on that day). You would specify the query like this:

            'created_on:[20060425 20061111}'

In FQL, you can also express upper and lower bounded range queries. The open bounds are identified by the > and < tokens. For example, if I want all documents created after the 25th of July, 1977, I would write the query like this:

            'created_on:[20060725>'

To find all documents created before that date, you could type this:

            'created_on:<20060725}'

Alternatively, you can use the >, <, >=, and <= tokens to specify singly bounded range queries. The previous two queries would be, respectively:

            'created_on:>= 20060725'
'created_on:< 20060725'

WildcardQuery

Wildcard queries in Ferret make use of the * and ? characters. Just to reiterate what we covered in the WildcardQuery” section earlier in this chapter, * will match any number of characters, whereas ? will match a single character only. So, the following query will match all documents with the terms “lend”, “legend”, or “lead” in the :content field:

            'content:l*e?d'

We can also use wildcard query syntax to create a few other types of queries. For example, to create a MatchAllQuery, you would type:

            '*'

Note that it makes no difference if we add a field constraint to this query. To find all documents with a price field, you might be tempted to type the following:

            'price:*'

But this will match all documents. Instead, you need to type the following:

            'price:?*'

You can also create prefix queries using wildcard syntax. Simply type the prefix and append * to the end. For example:

            'category:/programming/ruby/*'

This query will be optimized into a PrefixQuery by the QueryParser.

FuzzyQuery

In the FuzzyQuery” section earlier in this chapter, we said that FuzzyQueries are to TermQueries as sloppy PhraseQueries are to standard PhraseQueries, so it should come as no surprise that FuzzyQueries use the same syntax as sloppy PhraseQueries. Instead of a “slop” integer, however, we have a “similarity” float, which must be between 0.0 and 1.0. Another difference is that FuzzyQueries have a default similarity of 0.5, so you don’t need to specify a similarity value at all. Let’s say, for example, that we wish to find all documents containing the commonly misspelled word “mischievous”:

            'mischievous~'

Or we could make the query more strict by increasing the similarity value, like this:

            'mischievous~0.8'

Remember that FuzzyQueries are expensive queries to use on a large index, so you may want to set the default prefix length as described at the end of the FuzzyQuery” section.

Boosting a query in FQL

Boosting queries in FQL is a simple matter of appending the query with ^ and a boost value (see the Boosting Queries” section earlier in this chapter). For example, let’s go back to our Boolean search for “Ferret” where the results included the term “Ruby” but not the term “pet”. In this case, “Ferret” is the most important term, so we should boost it:

            '+Ferret^10.0 Ruby^0.1 -pet'

Note that it makes no sense to boost negative clauses in a boolean query. We should also note that the boost comes after the slop in a sloppy PhraseQuery and the similarity in a FuzzyQuery:

            '"quick brown fox"~5^10.0 AND date:>=20060601'
            'mischievous~0.8^10.0 AND date:>=20060601'

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.