Tokenizers

The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:

  • Word-oriented tokenizer: This splits the character stream into individual tokens.
  • Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
  • Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.

We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented ...

Get Advanced Elasticsearch 7.0 now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.