Query highlighting, like excerpting, is one of the newer features
in Ferret, added in version 0.10. Highlighting takes a query and returns
the data from a document field with all of the matches in the field
highlighted. Excerpting, on the other hand, takes excerpts from the
field, preferably with matching terms, and highlights the terms in those
excerpts. Both Ferret::Search::Searcher
and Ferret::Index::Index
classes have a
highlight
method. In this section, we’ll look at Index#highlight
because it allows us to pass
string queries instead of having to build Query
objects
(see Table 4-3). Otherwise, both methods are essentially
the same. To use the highlight
method,
you must supply a query and the document ID of the document you wish to
highlight. A number of other parameters can be used to describe exactly
how you want to highlight the field.
Table 4-3. Index#highlight parameters
Parameter | Description |
---|---|
:field
| Defaults to @options[:default_field] . The
highlighter only works on one field at a time, so you need to
specify which field it is you want to highlight. If you want to
highlight multiple fields, you'll need to call this method
multiple times. |
:excerpt_length
| Defaults to 150 bytes. This parameter specifies the length
of excerpt to show. The algorithm for extracting excerpts attempts
to fit as many matched terms into each excerpt as possible. If
you’d simply like the complete field back with all matches
highlighted, set this parameter to :all . |
:num_excerpts
| Specifies the number of excerpts you wish to retrieve. This
defaults to 2, unless
:excerpt_length is set to
:all , in which case :num_excerpts is automatically set to 1. |
:pre_tag
| To highlight matches, you need to specify short strings to
place before and after matches. :pre_tag defaults to <b> , which is fine when printing
HTML, but if you are printing results to the console, we
recommend using something like \033[36m . |
:post_tag
| Defaults to </b> .
This tag should close whatever you specified in :pre_tag . Try tag \033[m for console applications. |
:ellipsis
| Defaults funnily enough to ... . This is the string that is appended
at the beginning and end of excerpts where the excerpts break in
the middle of a field. Alternatively, you may want to use the HTML
entity … or the UTF-8
string \342\200\246 . |
The highlight
method returns an
array of strings, the strings being the extracted excerpts. Example 4-1 demonstrates the flexibility of Ferret’s
highlighting. We store the optional parameters in a hash to avoid
specifying them for each call to the highlight
method. We also use a StemmingAnalyzer
to demonstrate that phrases don’t need to be exact to match.
Don’t worry about how this works just yet. You’ll learn more about
analysis in the next chapter.
Example 4-1. Query highlighter
require
'
rubygems
'
require
'
ferret
'
class
MyAnalyzer
<
Ferret
::
Analysis
::
StandardAnalyzer
def
token_stream
(
field
,
input
)
Ferret
::
Analysis
::
StemFilter
.
new
(
super
)
end
end
index
=
Ferret
::
I
.
new
(
:analyzer
=>
MyAnalyzer
.
new
)
index
<<
{
:title
=>
"
Mark Twain Excerpts
",
:content
=>
<<-
EOF
If it had not been for him, with his incendiary "Early to bed and
early to rise," and all that sort of foolishness, I wouldn't have
been so harried and worried and raked out of bed at such unseemly
hours when I was young. The late Franklin was well enough in his
way; but it would have looked more dignified in him to have gone on
making candles and letting other people get up when they wanted to.
- Letter from Mark Twain, San Francisco Alta California, July 25, 1869
When one receives a letter from a great man for the first time in
his life, it is a large event to him, as all of you know by your own
experience. You never can receive letters enough from famous men
afterward to obliterate that one, or dim the memory of the pleasant
surprise it was, and the gratification it gave you.
- Mark Twain's Speeches, "Unconscious Plagiarism"
EOF
}
options
=
{
:field
=>
:content
,
:pre_tag
=>
"
\033
[36m",
:post_tag
=>
"
\033
[m",
:ellipsis
=>
"
\342\200\246
"
}
query
=
'
"Early <> Bed" "receive letter"~1 Twain early
'
puts
"
_
"
*
60
+
"
\n\t
*** Extract two excerpts ***\n\n
"
puts
index
.
highlight
(
query
,
0
,
options
)
puts
"
_
"
*
60
+
"
\n\t
*** Extract four smaller excerpts ***\n\n
"
options
[
:num_excerpts
]
=
4
options
[
:excerpt_length
]
=
50
puts
index
.
highlight
(
query
,
0
,
options
)
puts
"
_
"
*
60
+
"
\n\t
*** Highlight the entire field ***\n\n
"
options
[
:excerpt_length
]
=
:all
puts
index
.
highlight
(
query
,
0
,
options
)
You’ll notice here that the second example that’s supposed to extract four excerpts of length 50 bytes actually extracts two excerpts of 50 bytes and one of 100 bytes. The excerpting algorithm works by attempting to place the excerpts so that the maximum number of matched terms will be shown. If it can concatenate two or more excerpts without reducing the number of matched terms shown, it will.
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.