O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Real-time Analytics with Storm and Cassandra

Book Description

Solve real-time analytics problems effectively using Storm and Cassandra

In Detail

This book will teach you how to use Storm for real-time data processing and to make your applications highly available with no downtime using Cassandra.

The book starts off with the basics of Storm and its components along with setting up the environment for the execution of a Storm topology in local and distributed mode. Moving on, you will explore the Storm and Zookeeper configurations, understand the Storm UI, set up Storm clusters, and monitor Storm clusters using various tools. You will then add NoSQL persistence to Storm and set up a Cassandra cluster. You will do all this while being guided by the best practices for Storm and Cassandra applications. Next, you will learn about data partitioning and consistent hashing in Cassandra through examples and also see high availability features and replication in Cassandra. Finally, you'll learn about different methods that you can use to manage and maintain Cassandra and Storm.

What You Will Learn

  • Integrate Storm applications with RabbitMQ for real-time analysis and processing of messages
  • Monitor highly distributed applications using Nagios
  • Integrate the Cassandra data store with Storm
  • Develop and maintain distributed Storm applications in conjunction with Cassandra and In Memory Database (memcache)
  • Build a Trident topology that enables real-time computing with Storm
  • Tune performance for Storm topologies based on the SLA and requirements of the application
  • Use Esper with the Storm framework for rapid development of applications

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Real-time Analytics with Storm and Cassandra
    1. Table of Contents
    2. Real-time Analytics with Storm and Cassandra
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Let's Understand Storm
      1. Distributed computing problems
        1. Real-time business solution for credit or debit card fraud detection
        2. Aircraft Communications Addressing and Reporting system
        3. Healthcare
        4. Other applications
      2. Solutions for complex distributed use cases
        1. The Hadoop solution
        2. A custom solution
        3. Licensed proprietary solutions
        4. Other real-time processing tools
      3. A high-level view of various components of Storm
      4. Delving into the internals of Storm
      5. Quiz time
      6. Summary
    9. 2. Getting Started with Your First Topology
      1. Prerequisites for setting up Storm
      2. Components of a Storm topology
        1. Spouts
        2. Bolts
        3. Streams
        4. Tuples – the data model in Storm
      3. Executing a sample Storm topology – local mode
        1. WordCount topology from the Storm-starter project
      4. Executing the topology in the distributed mode
        1. Set up Zookeeper (V 3.3.5) for Storm
        2. Setting up Storm in the distributed mode
        3. Launching Storm daemons
      5. Executing the topology from Command Prompt
        1. Tweaking the WordCount topology to customize it
      6. Quiz time
      7. Summary
    10. 3. Understanding Storm Internals by Examples
      1. Customizing Storm spouts
        1. Creating FileSpout
          1. Tweaking WordCount topology to use FileSpout
          2. The SocketSpout class
      2. Anchoring and acking
        1. The unreliable topology
      3. Stream groupings
        1. Local or shuffle grouping
        2. Fields grouping
        3. All grouping
        4. Global grouping
        5. Custom grouping
        6. Direct grouping
      4. Quiz time
      5. Summary
    11. 4. Storm in a Clustered Mode
      1. The Storm cluster setup
      2. Zookeeper configurations
        1. Cleaning up Zookeeper
      3. Storm configurations
        1. Storm logging configurations
        2. The Storm UI
          1. Section 1
          2. Section 2
          3. Section 3
          4. Section 4
          5. The visualization section
      4. Storm monitoring tools
      5. Quiz time
      6. Summary
    12. 5. Storm High Availability and Failover
      1. An overview of RabbitMQ
      2. Installing the RabbitMQ cluster
        1. Prerequisites for the setup of RabbitMQ
        2. Setting up a RabbitMQ server
        3. Testing the RabbitMQ server
          1. Creating a RabbitMQ cluster
          2. Enabling the RabbitMQ UI
          3. Creating mirror queues for high availability
      3. Integrating Storm with RabbitMQ
        1. Creating a RabbitMQ feeder component
        2. Wiring the topology for the AMQP spout
      4. Building high availability of components
        1. High availability of the Storm cluster
        2. Guaranteed processing of the Storm cluster
      5. The Storm isolation scheduler
      6. Quiz time
      7. Summary
    13. 6. Adding NoSQL Persistence to Storm
      1. The advantages of Cassandra
      2. Columnar database fundamentals
        1. Types of column families
        2. Types of columns
      3. Setting up the Cassandra cluster
        1. Installing Cassandra
      4. Multiple data centers
        1. Prerequisites for setting up multiple data centers
        2. Installing Cassandra data centers
      5. Introduction to CQLSH
      6. Introduction to CLI
      7. Using different client APIs to access Cassandra
      8. Storm topology wired to the Cassandra store
      9. The best practices for Storm/Cassandra applications
      10. Quiz time
      11. Summary
    14. 7. Cassandra Partitioning, High Availability, and Consistency
      1. Consistent hashing
        1. One or more node goes down
        2. One or more node comes back up
      2. Replication in Cassandra and strategies
      3. Cassandra consistency
        1. Write consistency
        2. Read consistency
        3. Consistency maintenance features
      4. Quiz time
      5. Summary
    15. 8. Cassandra Management and Maintenance
      1. Cassandra – gossip protocol
        1. Bootstrapping
        2. Failure scenario handling – detection and recovery
      2. Cassandra cluster scaling – adding a new node
      3. Cassandra cluster – replacing a dead node
      4. The replication factor
      5. The nodetool commands
      6. Cassandra fault tolerance
      7. Cassandra monitoring systems
        1. JMX monitoring
        2. Datastax OpsCenter
      8. Quiz time
      9. Summary
    16. 9. Storm Management and Maintenance
      1. Scaling the Storm cluster – adding new supervisor nodes
      2. Scaling the Storm cluster and rebalancing the topology
        1. Rebalancing using the GUI
        2. Rebalancing using the CLI
      3. Setting up workers and parallelism to enhance processing
        1. Scenario 1
        2. Scenario 2
        3. Scenario 3
      4. Storm troubleshooting
        1. The Storm UI
        2. Storm logs
      5. Quiz time
      6. Summary
    17. 10. Advance Concepts in Storm
      1. Building a Trident topology
      2. Understanding the Trident API
        1. Local partition manipulation operation
          1. Functions
          2. Filters
          3. partitionAggregate
            1. Sum aggregate
            2. CombinerAggregator
            3. ReducerAggregator
            4. Aggregator
        2. Operations related to stream repartitioning
        3. Data aggregations over the streams
        4. Grouping over a field in a stream
        5. Merge and join
      3. Examples and illustrations
      4. Quiz time
      5. Summary
    18. 11. Distributed Cache and CEP with Storm
      1. The need for distributed caching in Storm
      2. Introduction to memcached
        1. Setting up memcache
        2. Building a topology with a cache
      3. Introduction to the complex event processing engine
        1. Esper
        2. Getting started with Esper
        3. Integrating Esper with Storm
      4. Quiz time
      5. Summary
    19. A. Quiz Answers
      1. Chapter 1
      2. Chapter 2
      3. Chapter 3
      4. Chapter 4
      5. Chapter 5
      6. Chapter 6
      7. Chapter 7
      8. Chapter 8
      9. Chapter 9
      10. Chapter 10
      11. Chapter 11
    20. Index