TokenStream
A TokenStream
takes a field and turns it into a list of tokens. To implement a
TokenStream
, you need to supply two methods. TokenStream#next
should return
Tokens
in the order followed by nil
when there are no more tokens left in the
field. TokenStream#text=
is used to set
the text that the TokenStream
will
analyze. In Ferret, there are two types of TokenStreams
: Tokenizers
and TokenFilters
.
In the next two sections, we’ll make use of the following test code
to test each TokenStream
, printing the tokens in a
table:
def
test_token_stream
(
token_stream
)
puts
"
\033
[32mStart | End | PosInc | Text\033
[m"
while
t
=
token_stream
.
next
puts
"
%5d |%4d |%5d | %s
"
%
[
t
.
start
,
t
.
end
,
t
.
pos_inc
,
t
.
text
]
end
end
Tokenizer
Tokenizers
take the raw text data from a field and turn it into a list of
Tokens
. Ferret comes with a number of tokenizer
implementations, including:
WhiteSpaceTokenizer
(andAsciiWhiteSpaceTokenizer
)LetterTokenizer
(andAsciiLetterTokenizer
)StandardTokenizer
(andAsciiStandardTokenizer
)RegExpTokenizer
Where an ASCII tokenizer exists, the non-ASCII tokenizer is locale-sensitive. That means that the tokenizer will recognize letters, numbers, and whitespace as specified by your locale. If your locale is set to UTF-8, then the tokenizer will recognize UTF-8 characters. This means you need to make sure that the data you are feeding Ferret is in the correct encoding according to your locale; otherwise, you could wind up running into some strange errors. The ASCII tokenizers ...
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.