Learning Storm

Book Description

Create real-time stream processing applications with Apache Storm

In Detail

Starting with the very basics of Storm, you will learn how to set up Storm on a single machine and move on to deploying Storm on your cluster. You will understand how Kafka can be integrated with Storm using the Kafka spout.

You will then proceed to explore the Trident abstraction tool with Storm to perform stateful stream processing, guaranteeing single message processing in every topology. You will move ahead to learn how to integrate Hadoop with Storm. Next, you will learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, and Kafka to realize the full potential of Storm.

Finally, you will perform in-depth case studies on Apache log processing and machine learning with a focus on Storm, and through these case studies, you will discover Storm's realm of possibilities.

What You Will Learn

  • Learn the core concepts of Apache Storm and real-time processing
  • Deploy Storm in the local and clustered modes
  • Design and develop Storm topologies to solve real-world problems
  • Read data from external sources such as Apache Kafka for processing in Storm and store the output into HBase and Redis
  • Create Trident topologies to support various message-processing semantics
  • Monitor the health of a Storm cluster

Table of Contents

  1. Learning Storm
    1. Table of Contents
    2. Learning Storm
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Setting Up Storm on a Single Machine
      1. Features of Storm
      2. Storm components
        1. Nimbus
        2. Supervisor nodes
        3. The ZooKeeper cluster
      3. The Storm data model
        1. Definition of a Storm topology
        2. Operation modes
        3. Setting up your development environment
          1. Installing Java SDK 6
          2. Installing Maven
          3. Installing Git – distributed version control
          4. Installing the STS IDE
        4. Developing a sample topology
        5. Setting up ZooKeeper
        6. Setting up Storm on a single development machine
        7. Deploying the sample topology on a single-node cluster
      4. Summary
    9. 2. Setting Up a Storm Cluster
      1. Setting up a distributed Storm cluster
      2. Deploying a topology on a remote Storm cluster
        1. Deploying the sample topology on the remote cluster
      3. Configuring the parallelism of a topology
        1. The worker process
        2. The executor
        3. Tasks
        4. Configuring parallelism at the code level
        5. Distributing worker processes, executors, and tasks in the sample topology
      4. Rebalancing the parallelism of a topology
        1. Rebalancing the parallelism of the sample topology
      5. Stream grouping
        1. Shuffle grouping
        2. Fields grouping
        3. All grouping
        4. Global grouping
        5. Direct grouping
        6. Local or shuffle grouping
        7. Custom grouping
      6. Guaranteed message processing
      7. Summary
    10. 3. Monitoring the Storm Cluster
      1. Starting to use the Storm UI
      2. Monitoring a topology using the Storm UI
      3. Cluster statistics using the Nimbus thrift client
        1. Fetching information with the Nimbus thrift client
      4. Summary
    11. 4. Storm and Kafka Integration
      1. The Kafka architecture
        1. The producer
        2. Replication
        3. Consumers
        4. Brokers
        5. Data retention
      2. Setting up Kafka
        1. Setting up a single-node Kafka cluster
        2. Setting up a three-node Kafka cluster
          1. Running multiple Kafka brokers on a single node
      3. A sample Kafka producer
      4. Integrating Kafka with Storm
      5. Summary
    12. 5. Exploring High-level Abstraction in Storm with Trident
      1. Introducing Trident
      2. Understanding Trident's data model
      3. Writing Trident functions, filters, and projections
        1. Trident functions
        2. Trident filters
        3. Trident projections
      4. Trident repartitioning operations
        1. The shuffle operation
        2. The partitionBy operation
        3. The global operation
        4. The broadcast operation
        5. The batchGlobal operation
        6. The partition operation
      5. Trident aggregators
        1. The partition aggregate
        2. The aggregate
          1. The ReducerAggregator interface
          2. The Aggregator interface
          3. The CombinerAggregator interface
        3. The persistent aggregate
        4. Aggregator chaining
      6. Utilizing the groupBy operation
      7. A non-transactional topology
      8. A sample Trident topology
      9. Maintaining the topology state with Trident
      10. A transactional topology
      11. The opaque transactional topology
      12. Distributed RPC
      13. When to use Trident
      14. Summary
    13. 6. Integration of Storm with Batch Processing Tools
      1. Exploring Apache Hadoop
        1. Understanding HDFS
        2. Understanding YARN
      2. Installing Apache Hadoop
        1. Setting up password-less SSH
        2. Getting the Hadoop bundle and setting up environment variables
        3. Setting up HDFS
        4. Setting up YARN
      3. Integration of Storm with Hadoop
        1. Setting up Storm-YARN
      4. Deploying Storm-Starter topologies on Storm-YARN
      5. Summary
    14. 7. Integrating Storm with JMX, Ganglia, HBase, and Redis
      1. Monitoring the Storm cluster using JMX
      2. Monitoring the Storm cluster using Ganglia
      3. Integrating Storm with HBase
      4. Integrating Storm with Redis
      5. Summary
    15. 8. Log Processing with Storm
      1. Server log-processing elements
      2. Producing the Apache log in Kafka
      3. Splitting the server log line
      4. Identifying the country, the operating system type, and the browser type from the logfile
      5. Extracting the searched keyword
      6. Persisting the process data
      7. Defining a topology and the Kafka spout
      8. Deploying a topology
      9. MySQL queries
        1. Calculating the page hits from each country
        2. Calculating the count for each browser
        3. Calculating the count for each operating system
      10. Summary
    16. 9. Machine Learning
      1. Exploring machine learning
      2. Using Trident-ML
      3. The use case – clustering synthetic control data
      4. Producing a training dataset into Kafka
      5. Building a Trident topology to build the clustering model
      6. Summary
    17. Index

Product Information

  • Title: Learning Storm
  • Author(s): Ankit Jain, Anand Nalya
  • Release date: August 2014
  • Publisher(s): Packt Publishing
  • ISBN: 9781783981328