Apache Solr: A Practical Approach to Enterprise Search

Book description

Build an enterprise search engine using Apache Solr: index and search documents; ingest data from varied sources; apply various text processing techniques; utilize different search capabilities; and customize Solr to retrieve the desired results. Apache Solr: A Practical Approach to Enterprise Search explains each essential concept--backed by practical and industry examples--to help you attain expert-level knowledge.

The book, which assumes a basic knowledge of Java, starts with an introduction to Solr, followed by steps to setting it up, indexing your first set of documents, and searching them. It then introduces you to information retrieval and its implementation in Apache Solr; this will help you understand your search problem, decide the approach to build an effective solution, and use various metrics to evaluate the results.

The book next covers the schema design and techniques to build a text analysis chain for cleansing, normalizing and enriching your documents and addressing different types of search queries. It describes various popular matching techniques which are generally applied to improve the precision and recall of searches.

You will learn the end-to-end process of data ingestion from varied sources, metadata extraction, pre-processing and transformation of content, various search components, query parsers and other advanced search capabilities.

After covering out-of-the-box features, Solr expert Dikshant Shahi dives into ways you can customize Solr for your business and its specific requirements, along with ways to plug in your own components. Most important, you will learn about implementations for Solr scoring, factors affecting the document score, and tuning the score for the application at hand. The book explains why textual scoring is not sufficient for practical ranking of documents and ways to integrate real-world factors for contributing to the document ranking.

You'll see how to influence user experience by providing suggestions and recommendations. You'll also see integration of Solr with important related technologies such as OpenNLP and Tika. Additionally, you will learn about scaling Solr using SolrCloud.

This book concludes with coverage of semantic search capabilities, which is crucial for taking the search experience to the next level. By the end of Apache Solr, you will be proficient in designing and developing your search engine.

Table of contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. Contents at a Glance
  6. Contents
  7. About the Author
  8. About the Technical Reviewer
  9. Acknowledgments
  10. Introduction
  11. Chapter 1: Apache Solr: An Introduction
    1. Overview
    2. What Makes Apache Solr So Popular
    3. Major Building Blocks
    4. History
    5. What’s New in Solr 5.x
    6. Beyond Search
    7. Solr vs. Other Options
      1. Relational Databases
      2. Elasticsearch
    8. Related Technologies
    9. Summary
    10. Resources
  12. Chapter 2: Solr Setup and Administration
    1. Stand-Alone Server
    2. Prerequisites
    3. Download
    4. Terminology
      1. General Terminology
      2. SolrCloud Terminology
      3. Important Configuration Files
    5. Directory Structure
      1. Solr Installation
      2. Solr Home
    6. Hands-On Exercise
      1. Start Solr
      2. Create a Core
      3. Index Some Data
      4. Search for Results
    7. Solr Script
      1. Starting Solr
      2. Using Solr Help
      3. Stopping Solr
      4. Restarting Solr
      5. Determining Solr Status
      6. Configuring Solr Start
    8. Admin Web Interface
    9. Core Management
      1. Config Sets
      2. Create Configset
      3. Create Core
      4. Core Status
      5. Unload Core
      6. Delete Core
      7. Core Rename
      8. Core Swap
      9. Core Split
      10. Index Backup
      11. Index Restore
    10. Instance Management
      1. Setting Solr Home
      2. Memory Management
      3. Log Management
    11. Common Exceptions
      1. OutOfMemoryError—Java Heap Space
      2. OutOfMemoryError—PermGen Space
      3. TooManyOpenFiles
      4. UnSupportedClassVersionException
    12. Summary
  13. Chapter 3: Information Retrieval
    1. Introduction to Information Retrieval
    2. Search Engines
    3. Data and Its Categorization
      1. Structured
      2. Unstructured
      3. Semistructured
    4. Content Extraction
    5. Text Processing
      1. Cleansing and Normalization
      2. Enrichment
      3. Metadata Generation
    6. Inverted Index
    7. Retrieval Models
      1. Boolean Model
      2. Vector Space Model
      3. Probabilistic Model
      4. Language Model
    8. Information Retrieval Process
      1. Plan
      2. Execute
      3. Evaluate
    9. Summary
  14. Chapter 4: Schema Design and Text Analysis
    1. Schema Design
      1. Documents
      2. schema.xml File
      3. Fields
      4. fieldType
      5. copyField
      6. Define the Unique Key
      7. Dynamic Fields
      8. defaultSearchField
      9. solrQueryParser
      10. Similarity
    2. Text Analysis
      1. Tokens
      2. Terms
      3. Analyzers
      4. Analysis Phases
      5. Analysis Tools
      6. Analyzer Components
      7. Common Text Analysis Techniques
    3. Going Schemaless
      1. What Makes Solr Schemaless
      2. Configuration
      3. Limitations
    4. REST API for Managing Schema
      1. Configuration
      2. REST Endpoints
      3. Other Managed Resources
    5. solrconfig.xml File
    6. Frequently Asked Questions
      1. How do I handle the exception indicating that the _version_ field must exist in the schema?
      2. Why is my Schema Change Not Reflected in Solr?
      3. I Have Created a Core in Solr 5.0, but Schema.xml is Missing. Where Can I find it?
    7. Summary
  15. Chapter 5: Indexing Data
    1. Indexing Tools
      1. Post Script
      2. SimplePostTool
      3. curl
      4. SolrJ Java Library
      5. Other Libraries
    2. Indexing Process
      1. UpdateRequestHandler
      2. UpdateRequestProcessorChain
      3. UpdateRequestProcessor vs. Analyzer/Tokenizer
    3. Indexing Operations
      1. XML Documents
      2. JSON Documents
      3. CSV Documents
      4. Index Rich Documents
    4. DataImportHandler
    5. Document Preprocessing
      1. Language Detection
      2. Generate Unique ID
      3. Deduplication
      4. Document Expiration
    6. Indexing Performance
    7. Custom Components
      1. Custom UpdateRequestProcessor
    8. Frequently Occurring Problems
      1. Copying Multiple Fields to a Single-Valued Field
    9. Summary
  16. Chapter 6: Searching Data
    1. Search Basics
    2. Prerequisites
    3. Solr Search Process
      1. SearchHandler
      2. SearchComponent
      3. QueryParser
      4. QueryResponseWriter
    4. Solr Query
      1. Default Query
      2. Phrase Query
      3. Proximity Query
      4. Fuzzy Query
      5. Wildcard Query
      6. Range Query
      7. Function Query
      8. Filter Query
      9. Query Boosting
      10. Global Query Parameters
    5. Query Parsers
      1. Standard Query Parser
      2. DisMax Query Parser
      3. eDisMax Query Parser
    6. JSON Request API
    7. Customizing Solr
      1. Custom SearchComponent
      2. Sample Component
    8. Frequently Asked Questions
      1. I have used KeywordTokenizerFactory in fieldType definition but why is my query string getting tokenized on whitespace?
      2. How can I find all the documents that contain no value?
      3. How can I apply negative boost on terms?
      4. Which are the special characters in query string. How should they be handled?
    9. Summary
  17. Chapter 7: Searching Data: Part 2
    1. Local Parameters
      1. Syntax
      2. Example
    2. Result Grouping
      1. Prerequisites
      2. Request Parameters
      3. Example
    3. Statistics
      1. Request Parameters
      2. Supported Methods
      3. LocalParams
      4. Example
    4. Faceting
      1. Prerequisites
      2. Syntax
      3. Example
      4. Faceting Types
    5. Reranking Query
      1. Request Parameters
      2. Example
    6. Join Query
      1. Limitations
      2. Example
    7. Block Join
      1. Prerequisites
      2. Example
    8. Function Query
      1. Prerequisites
      2. Usage
      3. Function Categories
      4. Example
      5. Caution
      6. Custom Function Query
    9. Referencing an External File
      1. Usage
    10. Summary
  18. Chapter 8: Solr Scoring
    1. Introduction to Solr Scoring
    2. Default Scoring
      1. Implementation
      2. Scoring Factors
      3. Scoring Formula
      4. Limitations
    3. Explain Query
    4. Alternative Scoring Models
      1. BM25Similarity
      2. DFRSimilarity
      3. Other Similarity Measures
    5. Per Field Similarity
    6. Custom Similarity
    7. Summary
  19. Chapter 9: Additional Features
    1. Sponsored Search
      1. Usage
    2. Spell-Checking
      1. Generic Parameters
      2. Implementations
      3. How It Works
      4. Usage
    3. Autocomplete
      1. Traditional Approach
      2. SuggestComponent
    4. Document Similarity
      1. Prerequisites
      2. Implementations
    5. Summary
  20. Chapter 10: Traditional Scaling and SolrCloud
    1. Stand-Alone Mode
    2. Sharding
    3. Master-Slave Architecture
      1. Master
      2. Slave
    4. Shards with Master-Slave
    5. SolrCloud
      1. Understanding the Terminology
      2. Starting SolrCloud
      3. Restarting a Node
      4. Creating a Collection
      5. Uploading to ZooKeeper
      6. Deleting a Collection
      7. Indexing a Document
      8. Load Balancing
      9. Document Routing
      10. Working with a Transaction Log
      11. Performing a Shard Health Check
      12. Querying Results
      13. Performing a Recovery
      14. Shard Splitting
      15. Adding a Replica
    6. ZooKeeper
    7. Frequently Asked Questions
      1. Why is the size of my data/tlog directory growing drastically? How can I handle that?
      2. Can I totally disable transaction logs? What would be the impact?
      3. I have recently migrated from traditional architecture to SolrCloud. Is there anything that I should be careful of and not do in SolrCloud?
      4. I am migrating to SolrCloud, but it fails to upload the configurations to ZooKeeper. What could be the reason?
    8. Summary
  21. Chapter 11: Semantic Search
    1. Limitations of Keyword Systems
    2. Semantic Search
    3. Tools
      1. OpenNLP
      2. Apache UIMA
      3. Apache Stanbol
    4. Techniques Applied
    5. Part-of-Speech Tagging
      1. Solr Plug-in for POS Tagging
    6. Named-Entity Extraction
      1. Using Rules and Regex
      2. Using a Dictionary or Gazetteer
      3. Using a Trained Model
    7. Semantic Enrichment
      1. Synonym Expansion
      2. WordNet
      3. Solr Plug-in for Synonym Expansion
    8. Summary
  22. Index

Product information

  • Title: Apache Solr: A Practical Approach to Enterprise Search
  • Author(s):
  • Release date: December 2015
  • Publisher(s): Apress
  • ISBN: 9781484210703