O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Elasticsearch for Hadoop

Book Description

Integrate Elasticsearch into Hadoop to effectively visualize and analyze your data

About This Book

  • Build production-ready analytics applications by integrating the Hadoop ecosystem with Elasticsearch
  • Learn complex Elasticsearch queries and develop real-time monitoring Kibana dashboards to visualize your data
  • Use Elasticsearch and Kibana to search data in Hadoop easily with this comprehensive, step-by-step guide

Who This Book Is For

This book is targeted at Java developers with basic knowledge on Hadoop. No prior Elasticsearch experience is expected.

What You Will Learn

  • Set up the Elasticsearch-Hadoop environment
  • Import HDFS data into Elasticsearch with MapReduce jobs
  • Perform full-text search and aggregations efficiently using Elasticsearch
  • Visualize data and create interactive dashboards using Kibana
  • Check and detect anomalies in streaming data using Storm and Elasticsearch
  • Inject and classify real-time streaming data into Elasticsearch
  • Get production-ready for Elasticsearch-Hadoop based projects
  • Integrate with Hadoop eco-system such as Pig, Storm, Hive, and Spark

In Detail

The Hadoop ecosystem is a de-facto standard for processing terra-bytes and peta-bytes of data. Lucene-enabled Elasticsearch is becoming an industry standard for its full-text search and aggregation capabilities. Elasticsearch-Hadoop serves as a perfect tool to bridge the worlds of Elasticsearch and Hadoop ecosystem to get best out of both the worlds. Powered with Kibana, this stack makes it a cakewalk to get surprising insights out of your massive amount of Hadoop ecosystem in a flash.

In this book, you'll learn to use Elasticsearch, Kibana and Elasticsearch-Hadoop effectively to analyze and understand your HDFS and streaming data.

You begin with an in-depth understanding of the Hadoop, Elasticsearch, Marvel, and Kibana setup. Right after this, you will learn to successfully import Hadoop data into Elasticsearch by writing MapReduce job in a real-world example. This is then followed by a comprehensive look at Elasticsearch essentials, such as full-text search analysis, queries, filters and aggregations; after which you gain an understanding of creating various visualizations and interactive dashboard using Kibana. Classifying your real-world streaming data and identifying trends in it using Storm and Elasticsearch are some of the other topics that we'll cover. You will also gain an insight about key concepts of Elasticsearch and Elasticsearch-hadoop in distributed mode, advanced configurations along with some common configuration presets you may need for your production deployments. You will have “Go production checklist” and high-level view for cluster administration for post-production. Towards the end, you will learn to integrate Elasticsearch with other Hadoop eco-system tools, such as Pig, Hive and Spark.

Style and approach

A concise yet comprehensive approach has been adopted with real-time examples to help you grasp the concepts easily.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Elasticsearch for Hadoop
    1. Table of Contents
    2. Elasticsearch for Hadoop
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Setting Up Environment
      1. Setting up Hadoop for Elasticsearch
        1. Setting up Java
        2. Setting up a dedicated user
        3. Installing SSH and setting up the certificate
        4. Downloading Hadoop
        5. Setting up environment variables
        6. Configuring Hadoop
          1. Configuring core-site.xml
          2. Configuring hdfs-site.xml
          3. Configuring yarn-site.xml
          4. Configuring mapred-site.xml
          5. The format distributed filesystem
        7. Starting Hadoop daemons
      2. Setting up Elasticsearch
        1. Downloading Elasticsearch
        2. Configuring Elasticsearch
        3. Installing Elasticsearch's Head plugin
        4. Installing the Marvel plugin
        5. Running and testing
      3. Running the WordCount example
        1. Getting the examples and building the job JAR file
        2. Importing the test file to HDFS
        3. Running our first job
      4. Exploring data in Head and Marvel
        1. Viewing data in Head
        2. Using the Marvel dashboard
          1. Exploring the data in Sense
      5. Summary
    9. 2. Getting Started with ES-Hadoop
      1. Understanding the WordCount program
        1. Understanding Mapper
        2. Understanding the reducer
        3. Understanding the driver
        4. Using the old API – org.apache.hadoop.mapred
      2. Going real — network monitoring data
        1. Getting and understanding the data
        2. Knowing the problems
        3. Solution approaches
          1. Approach 1 – Preaggregate the results
          2. Approach 2 – Aggregate the results at query-time
      3. Writing the NetworkLogsMapper job
        1. Writing the mapper class
        2. Writing Driver
        3. Building the job
        4. Getting the data into HDFS
        5. Running the job
        6. Viewing the Top N results
      4. Getting data from Elasticsearch to HDFS
        1. Understanding the Twitter dataset
          1. Trying it yourself
        2. Creating the MapReduce job to import data from Elasticsearch to HDFS
          1. Writing the Tweets2Hdfs mapper
          2. Running the example
          3. Testing the job execution output
      5. Summary
    10. 3. Understanding Elasticsearch
      1. Knowing Search and Elasticsearch
        1. The paradigm mismatch
          1. Index
          2. Type
          3. Document
          4. Field
      2. Talking to Elasticsearch
        1. CRUD with Elasticsearch
          1. Creating the document request
            1. The GET request
            2. The Update request
            3. The Delete request
            4. Creating the index
        2. Mappings
          1. Data types
          2. Create mapping API
          3. Index templates
      3. Controlling the indexing process
        1. What is an inverted index?
        2. The input data analysis
          1. Removing stop words
          2. Case insensitive
          3. Stemming
          4. Synonyms
          5. Analyzers
      4. Elastic searching
        1. Writing search queries
          1. The URI search
          2. Matching all queries
          3. The term query
          4. The boolean query
          5. The match query
          6. The range query
          7. The wildcard query
          8. Filters
            1. The exists filter
            2. The geo distance filter
      5. Aggregations
        1. Executing the aggregation queries
          1. The terms aggregation
          2. Histograms
          3. The range aggregation
          4. The geo distance
        2. Sub-aggregations
          1. Try it yourself
      6. Summary
    11. 4. Visualizing Big Data Using Kibana
      1. Setting up and getting started
        1. Setting up Kibana
        2. Setting up datasets
          1. Try it out
        3. Getting started with Kibana
      2. Discovering data
        1. Visualizing the data
          1. The pie chart
          2. The stacked bar chart
          3. The date histogram with the stacked bar chart
          4. The area chart
          5. The split pie chart
          6. The sun burst chart
          7. The geographical chart
          8. Trying it out
        2. Creating dynamic dashboards
          1. Migrating the dashboards
      3. Summary
    12. 5. Real-Time Analytics
      1. Getting started with the Twitter Trend Analyser
        1. What are we trying to do?
        2. Setting up Apache Storm
      2. Injecting streaming data into Storm
        1. Writing a Storm spout
        2. Writing Storm bolts
        3. Creating a Storm topology
        4. Building and running a Storm job
      3. Analyzing trends
        1. Significant terms aggregation
        2. Viewing trends in Kibana
      4. Classifying tweets using percolators
        1. Percolator
        2. Building a percolator query effectively
        3. Classifying tweets
      5. Summary
    13. 6. ES-Hadoop in Production
      1. Elasticsearch in a distributed environment
        1. Elasticsearch clusters and nodes
          1. Node types
            1. The master node
            2. The data node
            3. The client node
            4. Tribe nodes
          2. Node discovery
            1. Multicast discovery
            2. Unicast discovery
        2. Data inside clusters
          1. Shards
          2. Replicas
          3. Shard allocation
      2. The ES-Hadoop architecture
        1. Dynamic parallelism
          1. Writing to Elasticsearch
          2. Reads from Elasticsearch
          3. Failure handling
        2. Data colocation
      3. Configuring the environment for production
        1. Hardware
          1. Memory
          2. CPU
          3. Disks
          4. Network
        2. Setting up the cluster
          1. The recommended cluster topology
          2. Set names
          3. Paths
          4. Memory configurations
          5. The split-brain problem
          6. Recovery configurations
        3. Configuration presets
          1. Rapid indexing
          2. Lightening a full text search
          3. Faster aggregations
        4. Bonus – the production deployment checklist
      4. Administration of clusters
        1. Monitoring the cluster health
        2. Snapshot and restore
          1. Backing up your data
          2. Restoring your data
      5. Summary
    14. 7. Integrating with the Hadoop Ecosystem
      1. Pigging out Elasticsearch
        1. Setting up Apache Pig for Elasticsearch
        2. Importing data to Elasticsearch
          1. Writing from the JSON source
          2. Type conversions
        3. Reading data from Elasticsearch
      2. SQLizing Elasticsearch with Hive
        1. Setting up Apache Hive
        2. Importing data to Elasticsearch
          1. Writing from the JSON source
          2. Type conversions
        3. Reading data from Elasticsearch
      3. Cascading with Elasticsearch
        1. Importing data to Elasticsearch
          1. Writing a cascading job
          2. Running the job
        2. Reading data from Elasticsearch
          1. Writing a reader job
        3. Using Lingual with Elasticsearch
      4. Giving Spark to Elasticsearch
        1. Setting up Spark
        2. Importing data to Elasticsearch
          1. Using SparkSQL
        3. Reading data from Elasticsearch
          1. Using SparkSQL
      5. ES-Hadoop on YARN
      6. Summary
    15. A. Configurations
      1. Basic configurations
        1. es.resource
        2. es.resource.read
        3. es.resource.write
        4. es.nodes
        5. es.port
      2. Write and query configurations
        1. es.query
        2. es.input.json
        3. es.write.operation
        4. es.update.script
        5. es.update.script.lang
        6. es.update.script.params
        7. es.update.script.params.json
        8. es.batch.size.bytes
        9. es.batch.size.entries
        10. es.batch.write.refresh
        11. es.batch.write.retry.count
        12. es.batch.write.retry.wait
        13. es.ser.reader.value.class
        14. es.ser.writer.value.class
        15. es.update.retry.on.conflict
      3. Mapping configurations
        1. es.mapping.id
        2. es.mapping.parent
        3. es.mapping.version
        4. es.mapping.version.type
        5. es.mapping.routing
        6. es.mapping.ttl
        7. es.mapping.timestamp
        8. es.mapping.date.rich
        9. es.mapping.include
        10. es.mapping.exclude
      4. Index configurations
        1. es.index.auto.create
        2. es.index.read.missing.as.empty
        3. es.field.read.empty.as.null
        4. es.field.read.validate.presence
      5. Network configurations
        1. es.nodes.discovery
        2. es.nodes.client.only
        3. es.http.timeout
        4. es.http.retries
        5. es.scroll.keepalive
        6. es.scroll.size
        7. es.action.heart.beat.lead
      6. Authentication configurations
        1. es.net.http.auth.user
        2. es.net.http.auth.pass
      7. SSL configurations
        1. es.net.ssl
        2. es.net.ssl.keystore.location
        3. es.net.ssl.keystore.pass
        4. es.net.ssl.keystore.type
        5. es.net.ssl.truststore.location
        6. es.net.ssl.truststore.pass
        7. es.net.ssl.cert.allow.self.signed
        8. es.net.ssl.protocol
        9. es.scroll.size
      8. Proxy configurations
        1. es.net.proxy.http.host
        2. es.net.proxy.http.port
        3. es.net.proxy.http.user
        4. es.net.proxy.http.pass
        5. es.net.proxy.http.use.system.props
        6. es.net.proxy.socks.host
        7. es.net.proxy.socks.port
        8. es.net.proxy.socks.user
        9. es.net.proxy.socks.pass
        10. es.net.proxy.socks.use.system.props
    16. Index