Elasticsearch stores data in a very systematic and easily accessible and searchable fashion. To make data analysis easy and data more searchable, when the data is inducted into Elasticsearch, the following steps are done:
- Initial tidying of the string received (sanitizing). This is done by a character filter in Elasticsearch. This filter can sanitize the string before actual tokenization. It can also take out unnecessary characters or can even transform certain characters as needed.
- Tokenize the string into terms for creating an Inverted Index. This is done by Tokenizers in Elasticsearch. Various types of tokenizers exist that can do the job of actually splitting the string to terms/tokens.
- Normalize the data and ...