O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Scaling Big Data with Hadoop and Solr - Second Edition

Book Description

Understand, design, build, and optimize your big data search engine with Hadoop and Apache Solr

In Detail

Together, Apache Hadoop and Apache Solr help organizations resolve the problem of information extraction from big data by providing excellent distributed faceted search capabilities.

This book will help you learn everything you need to know to build a distributed enterprise search platform as well as optimize this search to a greater extent, resulting in the maximum utilization of available resources. Starting with the basics of Apache Hadoop and Solr, the book covers advanced topics of optimizing search with some interesting real-world use cases and sample Java code.

This is a step-by-step guide that will teach you how to build a high performance enterprise search while scaling data with Hadoop and Solr in an effortless manner.

What You Will Learn

  • Understand Apache Hadoop, its ecosystem, and Apache Solr
  • Explore industry-based architectures by designing a big data enterprise search with their applicability and benefits
  • Integrate Apache Solr with big data technologies such as Cassandra to enable better scalability and high availability for big data
  • Optimize the performance of your big data search platform with scaling data
  • Write MapReduce tasks to index your data
  • Configure your Hadoop instance to handle real-world big data problems
  • Work with Hadoop and Solr using real-world examples to benefit from their practical usage
  • Use Apache Solr as a NoSQL database

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Scaling Big Data with Hadoop and Solr Second Edition
    1. Table of Contents
    2. Scaling Big Data with Hadoop and Solr Second Edition
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Processing Big Data Using Hadoop and MapReduce
      1. Apache Hadoop's ecosystem
        1. Core components
        2. Understanding Hadoop's ecosystem
      2. Configuring Apache Hadoop
        1. Prerequisites
        2. Setting up ssh without passphrase
        3. Configuring Hadoop
      3. Running Hadoop
      4. Setting up a Hadoop cluster
      5. Common problems and their solutions
      6. Summary
    9. 2. Understanding Apache Solr
      1. Setting up Apache Solr
        1. Prerequisites for setting up Apache Solr
        2. Running Apache Solr on jetty
        3. Running Solr on other J2EE containers
        4. Hello World with Apache Solr!
          1. Understanding Solr administration
          2. Solr navigation
        5. Common problems and solutions
      2. The Apache Solr architecture
      3. Configuring Solr
        1. Understanding the Solr structure
        2. Defining the Solr schema
          1. Solr fields
          2. Dynamic fields in Solr
          3. Copying the fields
          4. Dealing with field types
          5. Additional metadata configuration
          6. Other important elements of the Solr schema
        3. Configuration files of Apache Solr
          1. Working with solr.xml and Solr core
          2. Instance configuration with solrconfig.xml
          3. Understanding the Solr plugin
          4. Other configuration
      4. Loading data in Apache Solr
        1. Extracting request handler – Solr Cell
        2. Understanding data import handlers
        3. Interacting with Solr through SolrJ
        4. Working with rich documents (Apache Tika)
      5. Querying for information in Solr
      6. Summary
    10. 3. Enabling Distributed Search using Apache Solr
      1. Understanding a distributed search
        1. Distributed search patterns
        2. Apache Solr and distributed search
      2. Working with SolrCloud
        1. Why ZooKeeper?
        2. The SolrCloud architecture
        3. Building an enterprise distributed search using SolrCloud
          1. Setting up SolrCloud for development
          2. Setting up SolrCloud for production
          3. Adding a document to SolrCloud
          4. Creating shards, collections, and replicas in SolrCloud
        4. Common problems and resolutions
      3. Sharding algorithm and fault tolerance
        1. Document Routing and Sharding
        2. Shard splitting
        3. Load balancing and fault tolerance in SolrCloud
      4. Apache Solr and Big Data – integration with MongoDB
        1. What is NoSQL and how is it related to Big Data?
        2. MongoDB at glance
        3. Installing MongoDB
        4. Creating Solr indexes from MongoDB
      5. Summary
    11. 4. Big Data Search Using Hadoop and Its Ecosystem
      1. Understanding NoSQL
      2. Working with the Solr HDFS connector
      3. Big data search using Katta
        1. How Katta works?
        2. Setting up the Katta cluster
        3. Creating Katta indexes
      4. Using Solr 1045 Patch – map-side indexing
      5. Using Solr 1301 Patch – reduce-side indexing
      6. Distributed search using Apache Blur
        1. Setting up Apache Blur with Hadoop
      7. Apache Solr and Cassandra
        1. Working with Cassandra and Solr
          1. Single node configuration
          2. Integrating with multinode Cassandra
      8. Scaling Solr through Storm
        1. Getting along with Apache Storm
      9. Advanced analytics with Solr
        1. Integrating Solr and R
      10. Summary
    12. 5. Scaling Search Performance
      1. Understanding the limits
      2. Optimizing search schema
        1. Specifying default search field
        2. Configuring search schema fields
        3. Stop words
        4. Stemming
      3. Index optimization
        1. Limiting indexing buffer size
        2. When to commit changes?
        3. Optimizing index merge
          1. Optimize option for index merging
        4. Optimizing the container
        5. Optimizing concurrent clients
        6. Optimizing Java virtual memory
      4. Optimizing search runtime
        1. Optimizing through search query
          1. Filter queries
        2. Optimizing the Solr cache
          1. The filter cache
          2. The query result cache
          3. The document cache
          4. The field value cache
          5. The lazy field loading
        3. Optimizing Hadoop
      5. Monitoring Solr instance
        1. Using SolrMeter
      6. Summary
    13. A. Use Cases for Big Data Search
      1. E-Commerce websites
      2. Log management for banking
        1. The problem
        2. How can it be tackled?
        3. High-level design
    14. Index