Analysis is the foundation of any search library. It is the process of taking an input field and breaking it up into tokens to be added to the inverted index. So, why did we wait until now to cover this important subject? Most of the time, Ferret’s standard analyzer will do exactly what you need it to do. However, when it doesn’t, Ferret’s analysis API is very easy to extend to your needs. To understand the analysis API, you need to know about three classes:
Token is the basic datatype in analysis. It is basically just a
with four attributes:
The text attribute is obviously a
String holding the token’s
text. Ferret allows tokens of up to 255 bytes long. Any longer than that
and the text gets truncated to that
The start and end offsets hold the byte positions of the start and end of the token in the original field, the end being the byte immediately after the last byte in the token. For example, in the string “The Old Man and the Sea”, the “Old” token has a start offset of 4 and an end offset of 7. The difference between the start offset and the end offset is usually equal to the length of the token’s text, but not always. For example, Ferret’s standard analyzer strips possessives (’s). In the field “Jamie’s Kitchen”, for instance, the first token will be “Jamie” but the start and end offset will be 0 and 7, respectively, also encompassing the possessive “’s”. This makes ...