book

Elasticsearch: The Definitive Guide

by Clinton Gormley, Zachary Tong

January 2015

Intermediate to advanced

724 pages

13h 21m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Should Read This BookWhy We Wrote This BookElasticsearch VersionHow to Read This BookNavigating This BookOnline ResourcesConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
I. Getting Started
1. You Know, for Search…
Installing ElasticsearchInstalling MarvelRunning ElasticsearchViewing Marvel and SenseTalking to ElasticsearchJava APIRESTful API with JSON over HTTPDocument OrientedJSONFinding Your FeetLet’s Build an Employee DirectoryIndexing Employee DocumentsRetrieving a DocumentSearch LiteSearch with Query DSLMore-Complicated SearchesFull-Text SearchPhrase SearchHighlighting Our SearchesAnalyticsTutorial ConclusionDistributed NatureNext Steps
2. Life Inside a Cluster
An Empty ClusterCluster HealthAdd an IndexAdd FailoverScale HorizontallyThen Scale Some MoreCoping with Failure
3. Data In, Data Out
What Is a Document?Document Metadata_index_type_idOther MetadataIndexing a DocumentUsing Our Own IDAutogenerating IDsRetrieving a DocumentRetrieving Part of a DocumentChecking Whether a Document ExistsUpdating a Whole DocumentCreating a New DocumentDeleting a DocumentDealing with ConflictsOptimistic Concurrency ControlUsing Versions from an External SystemPartial Updates to DocumentsUsing Scripts to Make Partial UpdatesUpdating a Document That May Not Yet ExistUpdates and ConflictsRetrieving Multiple DocumentsCheaper in BulkDon’t Repeat YourselfHow Big Is Too Big?
4. Distributed Document Store
Routing a Document to a ShardHow Primary and Replica Shards InteractCreating, Indexing, and Deleting a DocumentRetrieving a DocumentPartial Updates to a DocumentMultidocument PatternsWhy the Funny Format?
5. Searching—The Basic Tools
The Empty SearchhitstookshardstimeoutMulti-index, MultitypePaginationSearch LiteThe _all FieldMore Complicated Queries
6. Mapping and Analysis
Exact Values Versus Full TextInverted IndexAnalysis and AnalyzersBuilt-in AnalyzersWhen Analyzers Are UsedTesting AnalyzersSpecifying AnalyzersMappingCore Simple Field TypesViewing the MappingCustomizing Field MappingsUpdating a MappingTesting the MappingComplex Core Field TypesMultivalue FieldsEmpty FieldsMultilevel ObjectsMapping for Inner ObjectsHow Inner Objects are IndexedArrays of Inner Objects
7. Full-Body Search
Empty SearchQuery DSLStructure of a Query ClauseCombining Multiple ClausesQueries and FiltersPerformance DifferencesWhen to Use WhichMost Important Queries and Filtersterm Filterterms Filterrange Filterexists and missing Filtersbool Filtermatch_all Querymatch Querymulti_match Querybool QueryCombining Queries with FiltersFiltering a QueryJust a FilterA Query as a FilterValidating QueriesUnderstanding ErrorsUnderstanding Queries

8. Sorting and Relevance
SortingSorting by Field ValuesMultilevel SortingSorting on Multivalue FieldsString Sorting and MultifieldsWhat Is Relevance?Understanding the ScoreUnderstanding Why a Document MatchedFielddata
9. Distributed Search Execution
Query PhaseFetch PhaseSearch Optionspreferencetimeoutroutingsearch_typescan and scroll
10. Index Management
Creating an IndexDeleting an IndexIndex SettingsConfiguring AnalyzersCustom AnalyzersCreating a Custom AnalyzerTypes and MappingsHow Lucene Sees DocumentsHow Types Are ImplementedAvoiding Type GotchasThe Root ObjectPropertiesMetadata: _source FieldMetadata: _all FieldMetadata: Document IdentityDynamic MappingCustomizing Dynamic Mappingdate_detectiondynamic_templatesDefault MappingReindexing Your DataIndex Aliases and Zero Downtime
11. Inside a Shard
Making Text SearchableImmutabilityDynamically Updatable IndicesDeletes and UpdatesNear Real-Time Searchrefresh APIMaking Changes Persistentflush APISegment Mergingoptimize API
II. Search in Depth
12. Structured Search
Finding Exact Valuesterm Filter with Numbersterm Filter with TextInternal Filter OperationCombining FiltersBool FilterNesting Boolean FiltersFinding Multiple Exact ValuesContains, but Does Not EqualEquals ExactlyRangesRanges on DatesRanges on StringsDealing with Null Valuesexists Filtermissing Filterexists/missing on ObjectsAll About CachingIndependent Filter CachingControlling CachingFilter Order
13. Full-Text Search
Term-Based Versus Full-TextThe match QueryIndex Some DataA Single-Word QueryMultiword QueriesImproving PrecisionControlling PrecisionCombining QueriesScore CalculationControlling PrecisionHow match Uses boolBoosting Query ClausesControlling AnalysisDefault AnalyzersConfiguring Analyzers in PracticeRelevance Is Broken!
14. Multifield Search
Multiple Query StringsPrioritizing ClausesSingle Query StringKnow Your DataBest Fieldsdis_max QueryTuning Best Fields Queriestie_breakermulti_match QueryUsing Wildcards in Field NamesBoosting Individual FieldsMost FieldsMultifield MappingCross-fields Entity SearchA Naive ApproachProblems with the most_fields ApproachField-Centric QueriesProblem 1: Matching the Same Word in Multiple FieldsProblem 2: Trimming the Long TailProblem 3: Term FrequenciesSolutionCustom _all Fieldscross-fields QueriesPer-Field BoostingExact-Value Fields
15. Proximity Matching
Phrase MatchingTerm PositionsWhat Is a PhraseMixing It UpMultivalue FieldsCloser Is BetterProximity for RelevanceImproving PerformanceRescoring ResultsFinding Associated WordsProducing ShinglesMultifieldsSearching for ShinglesPerformance
16. Partial Matching
Postcodes and Structured Dataprefix Querywildcard and regexp QueriesQuery-Time Search-as-You-TypeIndex-Time OptimizationsNgrams for Partial MatchingIndex-Time Search-as-You-TypePreparing the IndexQuerying the FieldEdge n-grams and PostcodesNgrams for Compound Words
17. Controlling Relevance
Theory Behind Relevance ScoringBoolean ModelTerm Frequency/Inverse Document Frequency (TF/IDF)Vector Space ModelLucene’s Practical Scoring FunctionQuery Normalization FactorQuery CoordinationIndex-Time Field-Level BoostingQuery-Time BoostingBoosting an Indext.getBoost()Manipulating Relevance with Query StructureNot Quite Notboosting QueryIgnoring TF/IDFconstant_score Queryfunction_score QueryBoosting by Popularitymodifierfactorboost_modemax_boostBoosting Filtered Subsetsfilter Versus queryfunctionsscore_modeRandom ScoringThe Closer, The BetterUnderstanding the price ClauseScoring with ScriptsPluggable Similarity AlgorithmsOkapi BM25Changing SimilaritiesConfiguring BM25Relevance Tuning Is the Last 10%
III. Dealing with Human Language
18. Getting Started with Languages
Using Language AnalyzersConfiguring Language AnalyzersPitfalls of Mixing LanguagesAt Index TimeAt Query TimeIdentifying LanguageOne Language per DocumentForeign WordsOne Language per FieldMixed-Language FieldsSplit into Separate FieldsAnalyze Multiple TimesUse n-grams
19. Identifying Words
standard Analyzerstandard TokenizerInstalling the ICU Plug-inicu_tokenizerTidying Up Input TextTokenizing HTMLTidying Up Punctuation
20. Normalizing Tokens
In That CaseYou Have an AccentRetaining MeaningLiving in a Unicode WorldUnicode Case FoldingUnicode Character FoldingSorting and CollationsCase-Insensitive SortingDifferences Between LanguagesUnicode Collation AlgorithmUnicode SortingSpecifying a LanguageCustomizing Collations
21. Reducing Words to Their Root Form
Algorithmic StemmersUsing an Algorithmic StemmerDictionary StemmersHunspell StemmerInstalling a DictionaryPer-Language SettingsCreating a Hunspell Token FilterHunspell Dictionary FormatChoosing a StemmerStemmer PerformanceStemmer QualityStemmer DegreeMaking a ChoiceControlling StemmingPreventing StemmingCustomizing StemmingStemming in situIs Stemming in situ a Good Idea
22. Stopwords: Performance Versus Precision
Pros and Cons of StopwordsUsing StopwordsStopwords and the Standard AnalyzerMaintaining PositionsSpecifying StopwordsUsing the stop Token FilterUpdating StopwordsStopwords and Performanceand Operatorminimum_should_matchDivide and ConquerControlling PrecisionOnly High-Frequency TermsMore Control with Common TermsStopwords and Phrase QueriesPositions DataIndex OptionsStopwordscommon_grams Token FilterAt Index TimeUnigram QueriesBigram Phrase QueriesTwo-Word PhrasesStopwords and Relevance
23. Synonyms
Using SynonymsFormatting SynonymsExpand or contractSimple ExpansionSimple ContractionGenre ExpansionSynonyms and The Analysis ChainCase-Sensitive SynonymsMultiword Synonyms and Phrase QueriesUse Simple Contraction for Phrase QueriesSynonyms and the query_string QuerySymbol Synonyms
24. Typoes and Mispelings
FuzzinessFuzzy QueryImproving PerformanceFuzzy match QueryScoring FuzzinessPhonetic Matching
IV. Aggregations
25. High-Level Concepts
BucketsMetricsCombining the Two
26. Aggregation Test-Drive
Adding a Metric to the MixBuckets Inside BucketsOne Final Modification
27. Building Bar Charts
28. Looking at Time
Returning Empty BucketsExtended ExampleThe Sky’s the Limit
29. Scoping Aggregations
Global Bucket
30. Filtering Queries and Aggregations
Filtered QueryFilter BucketPost FilterRecap
31. Sorting Multivalue Buckets
Intrinsic SortsSorting by a MetricSorting Based on “Deep” Metrics
32. Approximate Aggregations
Finding Distinct CountsUnderstanding the Trade-offsOptimizing for SpeedCalculating PercentilesPercentile MetricPercentile RanksUnderstanding the Trade-offs
33. Significant Terms
significant_terms DemoRecommending Based on PopularityRecommending Based on Statistics
34. Controlling Memory Use and Latency
FielddataAggregations and AnalysisHigh-Cardinality Memory ImplicationsLimiting Memory UsageFielddata SizeMonitoring fielddataCircuit BreakerFielddata FilteringDoc ValuesEnabling Doc ValuesPreloading FielddataEagerly Loading FielddataGlobal OrdinalsIndex WarmersPreventing Combinatorial ExplosionsDepth-First Versus Breadth-First
35. Closing Thoughts
V. Geolocation
36. Geo-Points
Lat/Lon FormatsFiltering by Geo-Pointgeo_bounding_box FilterOptimizing Bounding Boxesgeo_distance FilterFaster Geo-Distance Calculationsgeo_distance_range FilterCaching geo-filtersReducing Memory UsageSorting by DistanceScoring by Distance
37. Geohashes
Mapping Geohashesgeohash_cell Filter
38. Geo-aggregations
geo_distance Aggregationgeohash_grid Aggregationgeo_bounds Aggregation
39. Geo-shapes
Mapping geo-shapesprecisiondistance_error_pctIndexing geo-shapesQuerying geo-shapesQuerying with Indexed ShapesGeo-shape Filters and Caching
VI. Modeling Your Data
40. Handling Relationships
Application-side JoinsDenormalizing Your DataField CollapsingDenormalization and ConcurrencyRenaming Files and DirectoriesSolving Concurrency IssuesGlobal LockingDocument LockingTree Locking
41. Nested Objects
Nested Object MappingQuerying a Nested ObjectSorting by Nested FieldsNested Aggregationsreverse_nested AggregationWhen to Use Nested Objects
42. Parent-Child Relationship
Parent-Child MappingIndexing Parents and ChildrenFinding Parents by Their Childrenmin_children and max_childrenFinding Children by Their ParentsChildren AggregationGrandparents and GrandchildrenPractical ConsiderationsMemory UseGlobal Ordinals and LatencyMultigenerations and Concluding Thoughts
43. Designing for Scale
The Unit of ScaleShard OverallocationKagillion ShardsCapacity PlanningReplica ShardsBalancing Load with ReplicasMultiple IndicesTime-Based DataIndex per Time FrameIndex TemplatesRetiring DataMigrate Old IndicesOptimize IndicesClosing Old IndicesArchiving Old IndicesUser-Based DataShared IndexFaking Index per User with AliasesOne Big UserScale Is Not Infinite
VII. Administration, Monitoring, and Deployment
44. Monitoring
Marvel for MonitoringCluster HealthDrilling Deeper: Finding Problematic IndicesBlocking for Status ChangesMonitoring Individual Nodesindices SectionOS and Process SectionsJVM SectionThreadpool SectionFS and Network SectionsCircuit BreakerCluster StatsIndex StatsPending Taskscat API
45. Production Deployment
HardwareMemoryCPUsDisksNetworkGeneral ConsiderationsJava Virtual MachineTransport Client Versus Node ClientConfiguration ManagementImportant Configuration ChangesAssign NamesPathsMinimum Master NodesRecovery SettingsPrefer Unicast over MulticastDon’t Touch These Settings!Garbage CollectorThreadpoolsHeap: Sizing and SwappingGive Half Your Memory to LuceneDon’t Cross 32 GB!Swapping Is the Death of PerformanceFile Descriptors and MMapRevisit This List Before Production
46. Post-Deployment
Changing Settings DynamicallyLoggingSlowlogIndexing Performance TipsTest Performance ScientificallyUsing and Sizing Bulk RequestsStorageSegments and MergingOtherRolling RestartsBacking Up Your ClusterCreating the RepositorySnapshotting All Open IndicesSnapshotting Particular IndicesListing Information About SnapshotsDeleting SnapshotsMonitoring Snapshot ProgressCanceling a SnapshotRestoring from a SnapshotMonitoring Restore OperationsCanceling a RestoreClusters Are Living, Breathing Creatures
Index

Content preview from Elasticsearch: The Definitive Guide

Chapter 19. Identifying Words

A word in English is relatively simple to spot: words are separated by whitespace or (some) punctuation. Even in English, though, there can be controversy: is you’re one word or two? What about o’clock, cooperate, half-baked, or eyewitness?

Languages like German or Dutch combine individual words to create longer compound words like Weißkopfseeadler (white-headed sea eagle), but in order to be able to return Weißkopfseeadler as a result for the query Adler (eagle), we need to understand how to break up compound words into their constituent parts.

Asian languages are even more complex: some have no whitespace between words, sentences, or even paragraphs. Some words can be represented by a single character, but the same single character, when placed next to other characters, can form just one part of a longer word with a quite different meaning.

It should be obvious that there is no silver-bullet analyzer that will miraculously deal with all human languages. Elasticsearch ships with dedicated analyzers for many languages, and more language-specific analyzers are available as plug-ins.

However, not all languages have dedicated analyzers, and sometimes you won’t even be sure which language(s) you are dealing with. For these situations, we need good standard tools that do a reasonable job regardless of language.

standard Analyzer

The standard analyzer is used by default for any full-text analyzed string field. If we were to reimplement the standard analyzer ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449358532Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Elasticsearch: The Definitive Guide

by Clinton Gormley, Zachary Tong

Chapter 19. Identifying Words

standard Analyzer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.