Practical Real-time Data Processing and Analytics

Book description

A practical guide to help you tackle different real-time data processing and analytics problems using the best tools for each scenario

About This Book

  • Learn about the various challenges in real-time data processing and use the right tools to overcome them
  • This book covers popular tools and frameworks such as Spark, Flink, and Apache Storm to solve all your distributed processing problems
  • A practical guide filled with examples, tips, and tricks to help you perform efficient Big Data processing in real-time

Who This Book Is For

If you are a Java developer who would like to be equipped with all the tools required to devise an end-to-end practical solution on real-time data streaming, then this book is for you. Basic knowledge of real-time processing would be helpful, and knowing the fundamentals of Maven, Shell, and Eclipse would be great.

What You Will Learn

  • Get an introduction to the established real-time stack
  • Understand the key integration of all the components
  • Get a thorough understanding of the basic building blocks for real-time solution designing
  • Garnish the search and visualization aspects for your real-time solution
  • Get conceptually and practically acquainted with real-time analytics
  • Be well equipped to apply the knowledge and create your own solutions

In Detail

With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible.

This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you'll be equipped with a clear understanding of how to solve challenges on your own.

We'll cover topics such as how to set up components, basic executions, integrations, advanced use cases, alerts, and monitoring. You'll be exposed to the popular tools used in real-time processing today such as Apache Spark, Apache Flink, and Storm. Finally, you will put your knowledge to practical use by implementing all of the techniques in the form of a practical, real-world use case.

By the end of this book, you will have a solid understanding of all the aspects of real-time data processing and analytics, and will know how to deploy the solutions in production environments in the best possible manner.

Style and Approach

In this practical guide to real-time analytics, each chapter begins with a basic high-level concept of the topic, followed by a practical, hands-on implementation of each concept, where you can see the working and execution of it. The book is written in a DIY style, with plenty of practical use cases, well-explained code examples, and relevant screenshots and diagrams.

Publisher resources

Download Example Code

Table of contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. Introducing Real-Time Analytics
    1. What is big data?
    2. Big data infrastructure
    3. Real–time analytics – the myth and the reality
    4. Near real–time solution – an architecture that works
      1. NRT – The Storm solution
      2. NRT – The Spark solution
    5. Lambda architecture – analytics possibilities
    6. IOT – thoughts and possibilities
      1. Edge analytics
    7. Cloud – considerations for NRT and IOT
    8. Summary
  3. Real Time Applications – The Basic Ingredients
    1. The NRT system and its building blocks
      1. Data collection
      2. Stream processing
      3. Analytical layer – serve it to the end user
    2. NRT – high-level system view
    3. NRT – technology view
      1. Event producer
      2. Collection
      3. Broker
      4. Transformation and processing
      5. Storage
    4. Summary
  4. Understanding and Tailing Data Streams
    1. Understanding data streams
    2. Setting up infrastructure for data ingestion
      1. Apache Kafka
      2. Apache NiFi
      3. Logstash
      4. Fluentd
      5. Flume
    3. Taping data from source to the processor - expectations and caveats
    4. Comparing and choosing what works best for your use case
    5. Do it yourself
      1. Setting up Elasticsearch
    6. Summary
  5. Setting up the Infrastructure for Storm
    1. Overview of Storm
    2. Storm architecture and its components
      1. Characteristics
      2. Components
      3. Stream grouping
    3. Setting up and configuring Storm
      1. Setting up Zookeeper
        1. Installing
        2. Configuring
          1. Standalone
          2. Cluster
        3. Running
      2. Setting up Apache Storm
        1. Installing
        2. Configuring
        3. Running
    4. Real-time processing job on Storm
      1. Running job
        1. Local
        2. Cluster
    5. Summary
  6. Configuring Apache Spark and Flink
    1. Setting up and a quick execution of Spark
      1. Building from source
      2. Downloading Spark
      3. Running an example
    2. Setting up and a quick execution of Flink
      1. Build Flink source
      2. Download Flink
      3. Running example
    3. Setting up and a quick execution of Apache Beam
      1. Beam model
      2. Running example
      3. MinimalWordCount example walk through
    4. Balancing in Apache Beam
    5. Summary
  7. Integrating Storm with a Data Source
    1. RabbitMQ – messaging that works
    2. RabbitMQ exchanges
      1. Direct exchanges
        1. Fanout exchanges
        2. Topic exchanges
        3. Headers exchanges
      2. RabbitMQ setup
      3. RabbitMQ — publish and subscribe
    3. RabbitMQ – integration with Storm
      1. AMQPSpout
    4. PubNub data stream publisher
    5. String together Storm-RMQ-PubNub sensor data topology
    6. Summary
  8. From Storm to Sink
    1. Setting up and configuring Cassandra
      1. Setting up Cassandra
      2. Configuring Cassandra
    2. Storm and Cassandra topology
    3. Storm and IMDB integration for dimensional data
    4. Integrating the presentation layer with Storm
      1. Setting up Grafana with the Elasticsearch plugin
        1. Downloading Grafana
        2. Configuring Grafana
        3. Installing the Elasticsearch plugin in Grafana
        4. Running Grafana
        5. Adding the Elasticsearch datasource in Grafana
        6. Writing code
        7. Executing code
        8. Visualizing the output on Grafana
    5. Do It Yourself
    6. Summary
  9. Storm Trident
    1. State retention and the need for Trident
      1. Transactional spout
      2. Opaque transactional Spout
    2. Basic Storm Trident topology
    3. Trident internals
    4. Trident operations
      1. Functions
      2. map and flatMap
      3. peek
      4. Filters
      5. Windowing
        1. Tumbling window
        2. Sliding window
      6. Aggregation
        1. Aggregate
        2. Partition aggregate
        3. Persistence aggregate
          1. Combiner aggregator
          2. Reducer aggregator
        4. Aggregator
      7. Grouping
      8. Merge and joins
    5. DRPC
    6. Do It Yourself
    7. Summary
  10. Working with Spark
    1. Spark overview
      1. Spark framework and schedulers
    2. Distinct advantages of Spark
      1. When to avoid using Spark
    3. Spark – use cases
    4. Spark architecture - working inside the engine
    5. Spark pragmatic concepts
      1. RDD – the name says it all
    6. Spark 2.x – advent of data frames and datasets
    7. Summary
  11. Working with Spark Operations
    1. Spark – packaging and API
    2. RDD pragmatic exploration
      1. Transformations
      2. Actions
    3. Shared variables – broadcast variables and accumulators
      1. Broadcast variables
      2. Accumulators
    4. Summary
  12. Spark Streaming
    1. Spark Streaming concepts
    2. Spark Streaming - introduction and architecture
    3. Packaging structure of Spark Streaming
      1. Spark Streaming APIs
      2. Spark Streaming operations
    4. Connecting Kafka to Spark Streaming
    5. Summary
  13. Working with Apache Flink
    1. Flink architecture and execution engine
    2. Flink basic components and processes
    3. Integration of source stream to Flink
      1. Integration with Apache Kafka
        1. Example
      2. Integration with RabbitMQ
        1. Running example
    4. Flink processing and computation
      1. DataStream API
      2. DataSet API
    5. Flink persistence
      1. Integration with Cassandra
        1. Running example
    6. FlinkCEP
    7. Pattern API
      1. Detecting pattern
      2. Selecting from patterns
      3. Example
    8. Gelly
      1. Gelly API
        1. Graph representation
        2. Graph creation
        3. Graph transformations
    9. DIY
    10. Summary
  14. Case Study
    1. Introduction
    2. Data modeling
    3. Tools and frameworks
    4. Setting up the infrastructure
    5. Implementing the case study
      1. Building the data simulator
      2. Hazelcast loader
      3. Building Storm topology
        1. Parser bolt
        2. Check distance and alert bolt
        3. Generate alert Bolt
        4. Elasticsearch Bolt
        5. Complete Topology
    6. Running the case study
      1. Load Hazelcast
        1. Generate Vehicle static value
        2. Deploy topology
        3. Start simulator
        4. Visualization using Kibana
    7. Summary

Product information

  • Title: Practical Real-time Data Processing and Analytics
  • Author(s): Shilpi Saxena, Saurabh Gupta
  • Release date: September 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781787281202