Mastering ElasticSearch

Book description

Written for intermediate users, this tutorial helps you utilize the power of Apache Lucene and Elastic Search to optimize your information retrieval. From design to implementation to management, it’s the all-inclusive guide.

  • Learn about Apache Lucene and ElasticSearch design and architecture to fully understand how this great search engine works
  • Design, configure, and distribute your index, coupled with a deep understanding of the workings behind it
  • Learn about the advanced features in an easy to read book with detailed examples that will help you understand and use the sophisticated features of ElasticSearch

In Detail

ElasticSearch is fast, distributed, scalable, and written in the Java search engine that leverages Apache Lucene capabilities providing a new level of control over how you index and search even the largest set of data.

"Mastering ElasticSearch" covers the intermediate and advanced functionalities of ElasticSearch and will let you understand not only how ElasticSearch works, but will also guide you through its internals such as caches, Apache Lucene library, monitoring capabilities, and the Java API. In addition to that you'll see the practical usage of ElasticSearch configuration parameters, monitoring API, and easy-to-use and extend examples on how to extend ElasticSearch by writing your own plugins.

"Mastering ElasticSearch" starts by showing you how Apache Lucene works and what the ElasticSearch architecture looks like. It covers advanced querying capabilities, index configuration control, index distribution, ElasticSearch administration and troubleshooting. Finally you'll see how to improve the user’s search experience, use the provided Java API and develop your own custom plugins.

It will help you learn how Apache Lucene works both in terms of querying and indexing. You'll also learn how to use different scoring models, rescoring documents using other queries, alter how the index is written by using custom postings and what segments merging is, and how to configure it to your needs. You'll optimize your queries by modifying them to use filters and you'll see why it is important. The book describes in details how to use the shard allocation mechanism present in ElasticSearch such as forced awareness.

"Mastering ElasticSearch" will open your eyes to the practical use of the statistics and information API available for the index, node and cluster level, so you are not surprised about what your ElasticSearch does while you are not looking. You'll also see how to troubleshoot by understanding how the Java garbage collector works, how to control I/O throttling, and see what threads are being executed at the any given moment. If user spelling mistakes are making you lose sleep at night - don't worry anymore the book will show you how to configure and use the ElasticSearch spell checker and improve the query relevance of your queries. Last, but not least you'll see how to use the ElasticSearch Java API to use the ElasticSearch cluster from your JVM based application and you'll extend ElasticSearch by writing your own custom plugins.

If you are looking for a book that will allow you to easily extend your basic knowledge about ElasticSearch or you want to go deeper into the world of full text search using ElasticSearch then this book is for you.

Table of contents

  1. Mastering ElasticSearch
    1. Table of Contents
    2. Mastering ElasticSearch
    3. Credits
    4. About the Authors
    5. About the Reviewers
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Introduction to ElasticSearch
      1. Introducing Apache Lucene
        1. Getting familiar with Lucene
        2. Overall architecture
        3. Analyzing your data
          1. Indexing and querying
        4. Lucene query language
          1. Understanding the basics
          2. Querying fields
          3. Term modifiers
          4. Handling special characters
      2. Introducing ElasticSearch
        1. Basic concepts
          1. Index
          2. Document
          3. Mapping
          4. Type
          5. Node
          6. Cluster
          7. Shard
          8. Replica
          9. Gateway
        2. Key concepts behind ElasticSearch architecture
        3. Working of ElasticSearch
          1. The boostrap process
          2. Failure detection
          3. Communicating with ElasticSearch
            1. Indexing data
            2. Querying data
            3. Index configuration
            4. Administration and monitoring
      3. Summary
    9. 2. Power User Query DSL
      1. Default Apache Lucene scoring explained
        1. When a document is matched
        2. The TF/IDF scoring formula
          1. The Lucene conceptual formula
          2. The Lucene practical formula
        3. The ElasticSearch point of view
      2. Query rewrite explained
        1. Prefix query as an example
        2. Getting back to Apache Lucene
        3. Query rewrite properties
      3. Rescore
        1. Understanding rescore
        2. Example Data
        3. Query
        4. Structure of the rescore query
        5. Rescore parameters
        6. To sum up
      4. Bulk Operations
        1. MultiGet
        2. MultiSearch
      5. Sorting data
        1. Sorting with multivalued fields
        2. Sorting with multivalued geo fields
        3. Sorting with nested objects
      6. Update API
        1. Simple field update
        2. Conditional modifications using scripting
        3. Creating and deleting documents using the Update API
      7. Using filters to optimize your queries
        1. Filters and caching
          1. Not all filters are cached by default
          2. Changing ElasticSearch caching behavior
          3. Why bother naming the key for the cache?
          4. When to change the ElasticSearch filter caching behavior
        2. The terms lookup filter
          1. How does it work?
          2. Performance considerations
          3. Loading terms from inner objects
          4. Terms lookup filter cache settings
      8. Filter and scopes in ElasticSearch faceting mechanism
        1. Example data
        2. Faceting and filtering
        3. Filter as a part of the query
        4. The Facet filter
        5. Global scope
      9. Summary
    10. 3. Low-level Index Control
      1. Altering Apache Lucene scoring
        1. Available similarity models
        2. Setting per-field similarity
      2. Similarity model configuration
        1. Choosing the default similarity model
        2. Configuring the chosen similarity models
          1. Configuring TF/IDF similarity
          2. Configuring Okapi BM25 similarity
          3. Configuring DFR similarity
          4. Configuring IB similarity
      3. Using codecs
        1. Simple use cases
        2. Let's see how it works
        3. Available posting formats
        4. Configuring the codec behavior
          1. Default codec properties
          2. Direct codec properties
          3. Memory codec properties
          4. Pulsing codec properties
          5. Bloom filter-based codec properties
      4. NRT, flush, refresh, and transaction log
        1. Updating index and committing changes
          1. Changing the default refresh time
        2. The transaction log
          1. The transaction log configuration
        3. Near Real Time GET
      5. Looking deeper into data handling
        1. Input is not always analyzed
        2. Example usage
        3. Changing the analyzer during indexing
        4. Changing the analyzer during searching
        5. The pitfall and default analysis
      6. Segment merging under control
        1. Choosing the right merge policy
          1. The tiered merge policy
          2. The log byte size merge policy
          3. The log doc merge policy
        2. Merge policies configuration
          1. The tiered merge policy
          2. The log byte size merge policy
          3. The log doc merge policy
        3. Scheduling
          1. The concurrent merge scheduler
          2. The serial merge scheduler
          3. Setting the desired merge scheduler
      7. Summary
    11. 4. Index Distribution Architecture
      1. Choosing the right amount of shards and replicas
        1. Sharding and over allocation
        2. A positive example of over allocation
        3. Multiple shards versus multiple indices
        4. Replicas
      2. Routing explained
        1. Shards and data
        2. Let's test routing
          1. Indexing with routing
        3. Indexing with routing
          1. Querying
        4. Aliases
        5. Multiple routing values
      3. Altering the default shard allocation behavior
        1. Introducing ShardAllocator
        2. The even_shard ShardAllocator
        3. The balanced ShardAllocator
        4. The custom ShardAllocator
        5. Deciders
          1. SameShardAllocationDecider
          2. ShardsLimitAllocationDecider
          3. FilterAllocationDecider
          4. ReplicaAfterPrimaryActiveAllocationDecider
          5. ClusterRebalanceAllocationDecider
          6. ConcurrentRebalanceAllocationDecider
          7. DisableAllocationDecider
          8. AwarenessAllocationDecider
          9. ThrottlingAllocationDecider
          10. RebalanceOnlyWhenActiveAllocationDecider
          11. DiskThresholdDecider
      4. Adjusting shard allocation
        1. Allocation awareness
          1. Forcing allocation awareness
        2. Filtering
          1. But what those properties mean?
        3. Runtime allocation updating
          1. Index-level updates
          2. Cluster-level updates
        4. Defining total shards allowed per node
          1. Inclusion
          2. Requirements
          3. Exclusion
        5. Additional shard allocation properties
      5. Query execution preference
        1. Introducing the preference parameter
      6. Using our knowledge
        1. Assumptions
          1. Data volume and queries specification
        2. Configuration
          1. Node-level configuration
          2. Indices configuration
          3. The directories layout
          4. Gateway configuration
          5. Recovery
          6. Discovery
          7. Logging slow queries
          8. Logging garbage collector work
          9. Memory setup
          10. One more thing
        3. Changes are coming
          1. Reindexing
          2. Routing
          3. Multiple Indices
      7. Summary
    12. 5. ElasticSearch Administration
      1. Choosing the right directory implementation – the store module
        1. Store type
          1. The simple file system store
          2. The new IO filesystem store
          3. The MMap filesystem store
          4. The memory store
            1. Additional properties
          5. The default store type
      2. Discovery configuration
        1. Zen discovery
          1. Multicast
          2. Unicast
          3. Minimum master nodes
          4. Zen discovery fault detection
        2. Amazon EC2 discovery
          1. EC2 plugin's installation
            1. EC2 plugin's configuration
            2. Optional EC2 discovery configuration options
            3. EC2 nodes scanning configuration
          2. Gateway and recovery configuration
          3. Gateway recovery process
          4. Configuration properties
          5. Expectations on nodes
        3. Local gateway
          1. Backing up the local gateway
        4. Recovery configuration
          1. Cluster-level recovery configuration
          2. Index-level recovery settings
      3. Segments statistics
        1. Introducing the segments API
          1. The response
        2. Visualizing segments information
      4. Understanding ElasticSearch caching
        1. The filter cache
          1. Filter cache types
          2. Index-level filter cache configuration
          3. Node-level filter cache configuration
        2. The field data cache
          1. Index-level field data cache configuration
          2. Node-level field data cache configuration
          3. Filtering
            1. Adding field data filtering information
            2. Filtering by term frequency
            3. Filtering by regex
            4. Filtering by regex and term frequency
            5. The filtering example
        3. Clearing the caches
          1. Index, indices, and all caches clearing
          2. Clearing specific caches
          3. Clearing fields-related caches
      5. Summary
    13. 6. Fighting with Fire
      1. Knowing the garbage collector
        1. Java memory
          1. The life cycle of Java object and garbage collections
        2. Dealing with garbage collection problems
          1. Turning on logging of garbage collection work
          2. Using JStat
          3. Creating memory dumps
          4. More information on garbage collector work
          5. Adjusting garbage collector work in ElasticSearch
            1. Using standard startup script
            2. Service wrapper
        3. Avoiding swapping on Unix-like systems
      2. When it is too much for I/O – throttling explained
        1. Controlling I/O throttling
        2. Configuration
          1. Throttling type
          2. Maximum throughput per second
          3. Node throttling defaults
          4. Configuration example
      3. Speeding up queries using warmers
        1. Reason for using warmers
        2. Manipulating warmers
          1. Using the PUT Warmer API
          2. Adding warmers during index creation
          3. Adding warmers to templates
          4. Retrieving warmers
          5. Deleting warmers
          6. Disabling warmers
        3. Testing the warmers
          1. Querying without warmers present
          2. Querying with warmer present
      4. Very hot threads
        1. Hot Threads API usage clarification
        2. Hot Threads API response
      5. Real-life scenarios
        1. Slower and slower performance
        2. Heterogeneous environment and load imbalance
        3. My server is under fire
      6. Summary
    14. 7. Improving the User Search Experience
      1. Correcting user spelling mistakes
        1. Test data
        2. Getting into technical details
          1. Suggesters
          2. Using the _suggest REST endpoint
            1. Understanding the REST endpoint suggester response
          3. Including suggestions requests in a query
            1. Suggester response
          4. The term suggester
            1. Configuration
              1. Common term suggester options
              2. Additional term suggester options
          5. The phrase suggester
            1. The usage example
            2. Configuration
              1. Basic configuration
              2. Configuring smoothing models
                1. Stupid backoff
                2. Laplace
                3. Linear interpolation
              3. Configuring candidate generators
                1. Direct generators
                2. Configuring direct generators
        3. Completion suggester
          1. The logic behind completion suggester
          2. Using completion suggester
            1. Indexing data
            2. Querying data
            3. Custom weights
            4. Additional parameters
      2. Improving query relevance
        1. The data
        2. The quest for improving relevance
          1. The standard query
          2. The Multi match query
          3. Phrases comes into play
          4. Let's throw the garbage away
          5. And now we boost
          6. Making a misspelling-proof search
          7. Drill downs with faceting
      3. Summary
    15. 8. ElasticSearch Java APIs
      1. Introducing the ElasticSearch Java API
      2. The code
      3. Connecting to your cluster
        1. Becoming the ElasticSearch node
        2. Using the transport connection method
        3. Choosing the right connection method
      4. Anatomy of the API
      5. CRUD operations
        1. Fetching documents
          1. Handling errors
        2. Indexing documents
        3. Updating documents
        4. Deleting documents
      6. Querying ElasticSearch
        1. Preparing a query
        2. Building queries
          1. Using the match all documents query
          2. The match query
          3. Using the geo shape query
        3. Paging
        4. Sorting
        5. Filtering
        6. Faceting
        7. Highlighting
        8. Suggestions
        9. Counting
        10. Scrolling
      7. Performing multiple actions
        1. Bulk
        2. The delete by query
        3. Multi GET
        4. Multi Search
      8. Percolator
        1. ElasticSearch 1.0 and higher
      9. The explain API
      10. Building JSON queries and documents
      11. The administration API
        1. The cluster administration API
          1. The cluster and indices health API
          2. The cluster state API
          3. The update settings API
          4. The reroute API
          5. The nodes information API
          6. The node statistics API
          7. The nodes hot threads API
          8. The nodes shutdown API
          9. The search shards API
        2. The Indices administration API
          1. The index existence API
          2. The Type existence API
          3. The indices stats API
          4. Index status
          5. Segments information API
          6. Creating an index API
          7. Deleting an index
          8. Closing an index
          9. Opening an index
          10. The Refresh API
          11. The Flush API
          12. The Optimize API
          13. The put mapping API
          14. The delete mapping API
          15. The gateway snapshot API
          16. The aliases API
          17. The get aliases API
          18. The aliases exists API
          19. The clear cache API
          20. The update settings API
          21. The analyze API
          22. The put template API
          23. The delete template API
          24. The validate query API
          25. The put warmer API
          26. The delete warmer API
      12. Summary
    16. 9. Developing ElasticSearch Plugins
      1. Creating the Apache Maven project structure
        1. Understanding the basics
        2. Structure of the Maven Java project
        3. The idea of POM
        4. Running the build process
        5. Introducing the assembly Maven plugin
      2. Creating a custom river plugin
        1. Implementation details
          1. Implementing the URLChecker class
          2. Implementing the JSONRiver class
          3. Implementing the JSONRiverModule class
          4. Implementing the JSONRiverPlugin class
          5. Informing ElasticSearch about the JSONRiver plugin class
        2. Testing our river
          1. Building our river
          2. Installing our river
          3. Initializing our river
          4. Checking if our JSON river works
      3. Creating custom analysis plugin
        1. Implementation details
          1. Implementing TokenFilter
          2. Implementing the TokenFilter factory
          3. Implementing custom analyzer
          4. Implementing analyzer provider
          5. Implementing analysis binder
          6. Implementing analyzer indices component
          7. Implementing analyzer module
          8. Implementing analyzer plugin
          9. Informing ElasticSearch about our custom analyzer
        2. Testing our custom analysis plugin
          1. Building our custom analysis plugin
          2. Installing the custom analysis plugin
          3. Checking if our analysis plugin works
      4. Summary
    17. Index

Product information

  • Title: Mastering ElasticSearch
  • Author(s):
  • Release date: October 2013
  • Publisher(s): Packt Publishing
  • ISBN: 9781783281435