O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Apache Storm

Book Description

Master the intricacies of Apache Storm and develop real-time stream processing applications with ease

About This Book

  • Exploit the various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more
  • Integrate Storm with other Big Data technologies like Hadoop, HBase, and Apache Kafka
  • An easy-to-understand guide to effortlessly create distributed applications with Storm

Who This Book Is For

If you are a Java developer who wants to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.

What You Will Learn

  • Understand the core concepts of Apache Storm and real-time processing
  • Follow the steps to deploy multiple nodes of Storm Cluster
  • Create Trident topologies to support various message-processing semantics
  • Make your cluster sharing effective using Storm scheduling
  • Integrate Apache Storm with other Big Data technologies such as Hadoop, HBase, Kafka, and more
  • Monitor the health of your Storm cluster

In Detail

Apache Storm is a real-time Big Data processing framework that processes large amounts of data reliably, guaranteeing that every message will be processed. Storm allows you to scale your data as it grows, making it an excellent platform to solve your big data problems. This extensive guide will help you understand right from the basics to the advanced topics of Storm.

The book begins with a detailed introduction to real-time processing and where Storm fits in to solve these problems. You’ll get an understanding of deploying Storm on clusters by writing a basic Storm Hello World example. Next we’ll introduce you to Trident and you’ll get a clear understanding of how you can develop and deploy a trident topology. We cover topics such as monitoring, Storm Parallelism, scheduler and log processing, in a very easy to understand manner. You will also learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery of Apache Storm. You will be able to use this knowledge to develop efficient, distributed real-time applications to cater to your business needs.

Style and approach

This easy-to-follow guide is full of examples and real-world applications to help you get an in-depth understanding of Apache Storm. This book covers the basics thoroughly and also delves into the intermediate and slightly advanced concepts of application development with Apache Storm.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Real-Time Processing and Storm Introduction
    1. Apache Storm
    2. Features of Storm
    3. Storm components
      1. Nimbus
      2. Supervisor nodes
      3. The ZooKeeper cluster
    4. The Storm data model
      1. Definition of a Storm topology
      2. Operation modes in Storm
    5. Programming languages
    6. Summary
  3. Storm Deployment, Topology Development, and Topology Options
    1. Storm prerequisites
      1. Installing Java SDK 7
      2. Deployment of the ZooKeeper cluster
    2. Setting up the Storm cluster
    3. Developing the hello world example
    4. The different options of the Storm topology
      1. Deactivate
      2. Activate
      3. Rebalance
      4. Kill
      5. Dynamic log level settings
    5. Walkthrough of the Storm UI
      1. Cluster Summary section
      2. Nimbus Summary section
      3. Supervisor Summary section
      4. Nimbus Configuration section
      5. Topology Summary section
    6. Dynamic log level settings
      1. Updating the log level from the Storm UI
      2. Updating the log level from the Storm CLI
    7. Summary
  4. Storm Parallelism and Data Partitioning
    1. Parallelism of a topology
      1. Worker process
      2. Executor
      3. Task
      4. Configure parallelism at the code level
      5. Worker process, executor, and task distribution
    2. Rebalance the parallelism of a topology
      1. Rebalance the parallelism of a SampleStormClusterTopology topology
    3. Different types of stream grouping in the Storm cluster
      1. Shuffle grouping
      2. Field grouping
      3. All grouping
      4. Global grouping
      5. Direct grouping
      6. Local or shuffle grouping
      7. None grouping
      8. Custom grouping
    4. Guaranteed message processing
    5. Tick tuple
    6. Summary
  5. Trident Introduction
    1. Trident introduction
    2. Understanding Trident's data model
    3. Writing Trident functions, filters, and projections
      1. Trident function
      2. Trident filter
      3. Trident projection
    4. Trident repartitioning operations
      1. Utilizing shuffle operation
      2. Utilizing partitionBy operation
      3. Utilizing global operation
      4. Utilizing broadcast operation
      5. Utilizing batchGlobal operation
      6. Utilizing partition operation
    5. Trident aggregator
      1. partitionAggregate
      2. aggregate
        1. ReducerAggregator
        2. Aggregator
        3. CombinerAggregator
      3. persistentAggregate
      4. Aggregator chaining
    6. Utilizing the groupBy operation
    7. When to use Trident
    8. Summary
  6. Trident Topology and Uses
    1. Trident groupBy operation
      1. groupBy before partitionAggregate
      2. groupBy before aggregate
    2. Non-transactional topology
    3. Trident hello world topology
    4. Trident state
    5. Distributed RPC
    6. When to use Trident
    7. Summary
  7. Storm Scheduler
    1. Introduction to Storm scheduler
    2. Default scheduler
    3. Isolation scheduler
    4. Resource-aware scheduler
      1. Component-level configuration
      2. Memory usage example
      3. CPU  usage example
      4. Worker-level configuration
      5. Node-level configuration
      6. Global component configuration
    5. Custom scheduler
      1. Configuration changes in the supervisor node
      2. Configuration setting at component level
      3. Writing a custom supervisor class
      4. Converting component IDs to executors
      5. Converting supervisors to slots
      6. Registering a CustomScheduler class
    6. Summary
  8. Monitoring of Storm Cluster
    1. Cluster statistics using the Nimbus thrift client
      1. Fetching information with Nimbus thrift
    2. Monitoring the Storm cluster using JMX
    3. Monitoring the Storm cluster using Ganglia
    4. Summary
  9. Integration of Storm and Kafka
    1. Introduction to Kafka
    2. Kafka architecture
      1. Producer
      2. Replication
      3. Consumer
      4. Broker
      5. Data retention
    3. Installation of Kafka brokers
      1. Setting up a single node Kafka cluster
      2. Setting up a three node Kafka cluster
        1. Multiple Kafka brokers on a single node
    4. Share ZooKeeper between Storm and Kafka
    5. Kafka producers and publishing data into Kafka
    6. Kafka Storm integration
    7. Deploy the Kafka topology on Storm cluster
    8. Summary
  10. Storm and Hadoop Integration
    1. Introduction to Hadoop
      1. Hadoop Common
      2. Hadoop Distributed File System
        1. Namenode
        2. Datanode
        3. HDFS client
        4. Secondary namenode
      3. YARN
        1. ResourceManager (RM)
        2. NodeManager (NM)
        3. ApplicationMaster (AM)
    2. Installation of Hadoop
      1. Setting passwordless SSH
      2. Getting the Hadoop bundle and setting up environment variables
      3. Setting up HDFS
      4. Setting up YARN
    3. Write Storm topology to persist data into HDFS
    4. Integration of Storm with Hadoop
    5. Setting up Storm-YARN
    6. Storm-Starter topologies on Storm-YARN
    7. Summary
  11. Storm Integration with Redis, Elasticsearch, and HBase
    1. Integrating Storm with HBase
    2. Integrating Storm with Redis
    3. Integrating Storm with Elasticsearch
    4. Integrating Storm with Esper
    5. Summary
  12. Apache Log Processing with Storm
    1. Apache log processing elements
    2. Producing Apache log in Kafka using Logstash
      1. Installation of Logstash
        1. What is Logstash?
        2. Why are we using Logstash?
        3. Installation of Logstash
        4. Configuration of Logstash
      2. Why are we using Kafka between Logstash and Storm?
    3. Splitting the Apache log line
    4. Identifying country, operating system type, and browser type from the log file
    5. Calculate the search keyword
    6. Persisting the process data
    7. Kafka spout and define topology
    8. Deploy topology
    9. MySQL queries
      1. Calculate the page hit from each country
      2. Calculate the count for each browser
      3. Calculate the count for each operating system
    10. Summary
  13. Twitter Tweet Collection and Machine Learning
    1. Exploring machine learning
    2. Twitter sentiment analysis
      1. Using Kafka producer to store the tweets in a Kafka cluster
    3. Kafka spout, sentiments bolt, and HDFS bolt
    4. Summary