O'Reilly logo

Ferret by David Balmain

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 5. Analysis

Analysis is the foundation of any search library. It is the process of taking an input field and breaking it up into tokens to be added to the inverted index. So, why did we wait until now to cover this important subject? Most of the time, Ferret’s standard analyzer will do exactly what you need it to do. However, when it doesn’t, Ferret’s analysis API is very easy to extend to your needs. To understand the analysis API, you need to know about three classes:

  • Token

  • TokenStream

  • Analyzer

Token

The Token is the basic datatype in analysis. It is basically just a Struct with four attributes:

  • Text

  • Start offset

  • End offset

  • Position increment

The text attribute is obviously a String holding the token’s text. Ferret allows tokens of up to 255 bytes long. Any longer than that and the text gets truncated to that length.

The start and end offsets hold the byte positions of the start and end of the token in the original field, the end being the byte immediately after the last byte in the token. For example, in the string “The Old Man and the Sea”, the “Old” token has a start offset of 4 and an end offset of 7. The difference between the start offset and the end offset is usually equal to the length of the token’s text, but not always. For example, Ferret’s standard analyzer strips possessives (’s). In the field “Jamie’s Kitchen”, for instance, the first token will be “Jamie” but the start and end offset will be 0 and 7, respectively, also encompassing the possessive “’s”. This makes ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required