Fast Data Processing Systems with SMACK Stack

Book description

Combine the incredible powers of Spark, Mesos, Akka, Cassandra, and Kafka to build data processing platforms that can take on even the hardest of your data troubles!

About This Book

  • This highly practical guide shows you how to use the best of the big data technologies to solve your response-critical problems

  • Learn the art of making cheap-yet-effective big data architecture without using complex Greek-letter architectures

  • Use this easy-to-follow guide to build fast data processing systems for your organization

  • Who This Book Is For

    If you are a developer, data architect, or a data scientist looking for information on how to integrate the Big Data stack architecture and how to choose the correct technology in every layer, this book is what you are looking for.

    What You Will Learn

  • Design and implement a fast data Pipeline architecture

  • Think and solve programming challenges in a functional way with Scala

  • Learn to use Akka, the actors model implementation for the JVM

  • Make on memory processing and data analysis with Spark to solve modern business demands

  • Build a powerful and effective cluster infrastructure with Mesos and Docker

  • Manage and consume unstructured and No-SQL data sources with Cassandra

  • Consume and produce messages in a massive way with Kafka

  • In Detail

    SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data. This highly practical guide will teach you how to integrate these technologies to create a highly efficient data analysis system for fast data processing.

    We’ll start off with an introduction to SMACK and show you when to use it. First you’ll get to grips with functional thinking and problem solving using Scala. Next you’ll come to understand the Akka architecture. Then you’ll get to know how to improve the data structure architecture and optimize resources using Apache Spark.

    Moving forward, you’ll learn how to perform linear scalability in databases with Apache Cassandra. You’ll grasp the high throughput distributed messaging systems using Apache Kafka. We’ll show you how to build a cheap but effective cluster infrastructure with Apache Mesos. Finally, you will deep dive into the different aspect of SMACK using a few case studies.

    By the end of the book, you will be able to integrate all the components of the SMACK stack and use them together to achieve highly effective and fast data processing.

    Style and approach

    With the help of various industry examples, you will learn about the full stack of big data architecture, taking the important aspects in every technology. You will learn how to integrate the technologies to build effective systems rather than getting incomplete information on single technologies. You will learn how various open source technologies can be used to build cheap and fast data processing systems with the help of various industry examples

    Table of contents

    1. Fast Data Processing Systems with SMACK Stack
      1. Fast Data Processing Systems with SMACK Stack
      2. Credits
      3. About the Author
      4. About the Reviewers
      5. www.PacktPub.com
        1. Why subscribe?
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      7. 1. An Introduction to SMACK
        1. Modern data-processing challenges
        2. The data-processing pipeline architecture
          1. The NoETL manifesto
          2. Lambda architecture
          3. Hadoop
        3. SMACK technologies
          1. Apache Spark
          2. Akka
          3. Apache Cassandra
          4. Apache Kafka
          5. Apache Mesos
        4. Changing the data center operations
          1. From scale-up to scale-out
          2. The open-source predominance
          3. Data store diversification
          4. Data gravity and data locality
          5. DevOps rules
        5. Data expert profiles
          1. Data architects
          2. Data engineers
          3. Data analysts
          4. Data scientists
        6. Is SMACK for me?
        7. Summary
      8. 2. The Model - Scala and Akka
        1. The language - Scala
          1. Kata 1 - The collections hierarchy
            1. Sequence
            2. Map
            3. Set
          2. Kata 2 - Choosing the right collection
            1. Sequence
            2. Map
            3. Set
          3. Kata 3 - Iterating with foreach
          4. Kata 4 - Iterating with for
          5. Kata 5 - Iterators
          6. Kata 6 - Transforming with map
          7. Kata 7 - Flattening
          8. Kata 8 - Filtering
          9. Kata 9 - Subsequences
          10. Kata 10 - Splitting
          11. Kata 11 - Extracting unique elements
          12. Kata 12 - Merging
          13. Kata 13 - Lazy views
          14. Kata 14 - Sorting
          15. Kata 15 - Streams
          16. Kata 16 - Arrays
          17. Kata 17 - ArrayBuffer
          18. Kata 18 - Queues
          19. Kata 19 - Stacks
          20. Kata 20 - Ranges
        2. The model - Akka
          1. The Actor Model in a nutshell
          2. Kata 21 - Actors
            1. The actor system
            2. Actor reference
          3. Kata 22 - Actor communication
          4. Kata 23 - Actor life cycle
          5. Kata 24 - Starting actors
          6. Kata 25 - Stopping actors
          7. Kata 26 - Killing actors
          8. Kata 27 - Shutting down the actor system
          9. Kata 28 - Actor monitoring
          10. Kata 29 - Looking up actors
        3. Summary
      9. 3. The Engine - Apache Spark
        1. Spark in single mode
          1. Downloading Apache Spark
          2. Testing Apache Spark
        2. Spark core concepts
        3. Resilient distributed datasets
          1. Running Spark applications
          2. Initializing the Spark context
          3. Spark applications
          4. Running programs
          5. RDD operation
            1. Transformations
            2. Actions
          6. Persistence (caching)
        4. Spark in cluster mode
          1. Runtime architecture
            1. Driver
              1. Dividing a program into tasks
              2. Scheduling tasks on executors
            2. Executor
            3. Cluster manager
            4. Program execution
            5. Application deployment
          2. Standalone cluster manager
            1. Launching the standalone manager
            2. Submitting our application
            3. Configuring resources
            4. Working in the cluster
        5. Spark Streaming
          1. Spark Streaming architecture
          2. Transformations
            1. Stateless transformations
            2. Stateful transformations
              1. Windowed operations
              2. Update state by key
          3. Output operations
          4. Fault-tolerant Spark Streaming
            1. Checkpointing
          5. Spark Streaming performance
            1. Parallelism level
          6. Window size and batch size
          7. Garbage collector
        6. Summary
      10. 4. The Storage - Apache Cassandra
        1. A bit of history
        2. NoSQL
          1. NoSQL or SQL?
          2. CAP Brewer's theorem
        3. Apache Cassandra installation
          1. Data model
          2. Data storage
          3. Installation
          4. DataStax OpsCenter
          5. Creating a key space
        4. Authentication and authorization (roles)
          1. Setting up a simple authentication and authorization
        5. Backup
          1. Compression
        6. Recovery
          1. Restart node
          2. Printing schema
          3. Logs
          4. Configuring log4j
          5. Log file rotation
          6. User activity log
          7. Transaction log
          8. SQL dump
          9. CQL
            1. CQL commands
          10. DBMS Cluster
            1. Deleting the database
              1. CLI delete commands
              2. CQL shell delete commands
          11. DB and DBMS optimization
          12. Bloom filter
          13. Data cache
          14. Java heap tune up
          15. Java garbage collection tune up
          16. Views, triggers, and stored procedures
          17. Client-server architecture
            1. Drivers
        7. Spark-Cassandra connector
          1. Installing the connector
          2. Establishing the connection
          3. Using the connector
        8. Summary
      11. 5. The Broker - Apache Kafka
        1. Introducing Kafka
          1. Features of Apache Kafka
          2. Born to be fast data
          3. Use cases
        2. Installation
          1. Installing Java
          2. Installing Kafka
          3. Importing Kafka
        3. Cluster
          1. Single node - single broker cluster
            1. Starting Zookeeper
            2. Starting the broker
            3. Creating a topic
            4. Starting a producer
            5. Starting a consumer
          2. Single node - Multiple broker cluster
            1. Starting the brokers
            2. Creating a topic
            3. Starting a producer
            4. Starting a consumer
          3. Multiple node - multiple broker cluster
          4. Broker properties
        4. Architecture
          1. Segment files
          2. Offset
          3. Leaders
          4. Groups
          5. Log compaction
          6. Kafka design
          7. Message compression
          8. Replication
            1. Asynchronous replication
            2. Synchronous replication
        5. Producers
          1. Producer API
          2. Scala producers
            1. Step 1: Import classes
            2. Step 2: Define properties
            3. Step 3: Build and send the message
            4. Step 4: Create the topic
            5. Step 5: Compile the producer
            6. Step 6: Run the producer
            7. Step 7: Run a consumer
          3. Producers with custom partitioning
            1. Step 1: Import classes
            2. Step 2: Define properties
            3. Step 3: Implement the partitioner class
            4. Step 4: Build and send the message
            5. Step 5: Create the topic
            6. Step 6: Compile the programs
            7. Step 7: Run the producer
            8. Step 8: Run a consumer
          4. Producer properties
        6. Consumers
          1. Consumer API
          2. Simple Scala consumers
            1. Step 1: Import classes
            2. Step 2: Define properties
            3. Step 3: Code the SimpleConsumer
            4. Step 4: Create the topic
            5. Step 5: Compile the program
            6. Step 6: Run the producer
            7. Step 7: Run the consumer
          3. Multithread Scala consumers
            1. Step 1: Import classes
            2. Step 2: Define properties
            3. Step 3: Code the MultiThreadConsumer
            4. Step 4: Create the topic
            5. Step 5: Compile the program
            6. Step 6: Run the producer
            7. Step 7: Run the consumer
              1. Consumer properties
        7. Integration
          1. Integration with Apache Spark
        8. Administration
          1. Cluster tools
          2. Adding servers
          3. Kafka topic tools
          4. Cluster mirroring
        9. Summary
      12. 6. The Manager - Apache Mesos
        1. The Apache Mesos architecture
          1. Frameworks
          2. Existing Mesos frameworks
            1. Frameworks for long running applications
            2. Frameworks for scheduling
            3. Frameworks for storage
          3. Attributes and resources
            1. Attributes
            2. Resources
          4. The Apache Mesos API
            1. Messages
            2. The Executor API
            3. Executor Driver API
            4. The Scheduler API
            5. The Scheduler Driver API
        2. Resource allocation
          1. The DRF algorithm
          2. Weighted DRF algorithm
          3. Resource configuration
          4. Resource reservation
            1. Static reservation
            2. Defining roles
            3. Assigning frameworks to roles
            4. Setting policies
            5. Dynamic reservation
            6. The reserve operation
            7. The unreserve operation
            8. HTTP reserve
            9. HTTP unreserve
        3. Running a Mesos cluster on AWS
          1. AWS instance types
            1. AWS instances launching
          2. Installing Mesos on AWS
          3. Downloading Mesos
          4. Building Mesos
            1. Launching several instances
        4. Running a Mesos cluster on a private data center
          1. Mesos installation
            1. Setting up the environment
            2. Start the master
            3. Start the slaves
            4. Process automation
          2. Common Mesos issues
            1. Missing library dependencies
            2. Directory permissions
            3. Missing library
            4. Debugging
            5. Directory structure
            6. Slaves not connecting with masters
            7. Multiple slaves on the same machine
        5. Scheduling and management frameworks
          1. Marathon
            1. Marathon installation
            2. Installing Apache Zookeeper
            3. Running Marathon in local mode
            4. Multi-node Marathon installation
            5. Running a test application from the web UI
            6. Application scaling
            7. Terminating the application
          2. Chronos
            1. Chronos installation
            2. Job scheduling
          3. Chronos and Marathon
            1. Chronos REST API
              1. Listing running jobs
              2. Starting a job manually
              3. Adding a job
              4. Deleting a job
              5. Deleting all the job tasks
            2. Marathon REST API
              1. Listing the running applications
              2. Adding an application
              3. Changing the application configuration
              4. Deleting the application
        6. Apache Aurora
          1. Installing Aurora
        7. Singularity
          1. Singularity installation
            1. The Singularity configuration file
        8. Apache Spark on Apache Mesos
          1. Submitting jobs in client mode
          2. Submitting jobs in cluster mode
          3. Advanced configuration
        9. Apache Cassandra on Apache Mesos
          1. Advanced configuration
        10. Apache Kafka on Apache Mesos
          1. Kafka log management
        11. Summary
      13. 7. Study Case 1 - Spark and Cassandra
        1. Spark Cassandra connector
          1. Requisites
          2. Preparing Cassandra
          3. SparkContext setup
          4. Cassandra and Spark Streaming
          5. Spark Streaming setup
          6. Cassandra setup
          7. Streaming context creation
          8. Stream creation
            1. Kafka Streams
            2. Akka Streams
            3. Enabling Cassandra
            4. Write the Stream to Cassandra
            5. Read the Stream from Cassandra
          9. Saving datasets to Cassandra
            1. Saving a collection of tuples to Cassandra
            2. Saving collections to Cassandra
            3. Modifying collections
          10. Saving objects of Cassandra (user defined types)
          11. Scala options to Cassandra options conversion
          12. Saving RDDs as new tables
          13. Cluster deployment
          14. Spark Cassandra use cases
        2. Study case: The Calliope project
          1. Installing Calliope
          2. CQL3
            1. Read from Cassandra with CQL3
            2. Write to Cassandra with CQL3
          3. Thrift
            1. Read from Cassandra with Thrift
            2. Write to Cassandra with Thrift
          4. Calliope SQL context creation
          5. Calliope SQL Configuration
            1. Loading Cassandra tables programmatically
        3. Summary
      14. 8. Study Case 2 - Connectors
        1. Akka and Cassandra
          1. Writing to Cassandra
          2. Reading from Cassandra
          3. Connecting to Cassandra
          4. Scanning tweets
          5. Testing the scanner
        2. Akka and Spark
        3. Kafka and Akka
        4. Kafka and Cassandra
        5. Summary
      15. 9. Study Case 3 - Mesos and Docker
        1. Mesos frameworks API
          1. Authentication, authorization, and access control
          2. Framework authentication
          3. Authentication configuration
          4. Framework authorization
          5. Access control lists
        2. Spark Mesos run modes
          1. Coarse-grained
          2. Fine-grained
        3. Apache Mesos API
          1. Scheduler HTTP API
            1. Requests
              1. SUBSCRIBE
              2. TEARDOWN
              3. ACCEPT
              4. DECLINE
              5. REVIVE
              6. KILL
              7. SHUTDOWN
              8. ACKNOWLEDGE
              9. RECONCILE
              10. MESSAGE
              11. REQUEST
            2. Responses
              1. SUBSCRIBED
              2. OFFERS
              3. RESCIND
              4. UPDATE
              5. MESSAGE
              6. FAILURE
              7. ERROR
              8. HEARTBEAT
        4. Mesos containerizers
          1. Containers
        5. Docker containerizers
          1. Containers and containerizers
          2. Types of containerizers
          3. Creating containerizers
          4. Mesos containerizer
            1. Launching Mesos containerizer
            2. Architecture of Mesos  containerizer
              1. Shared filesystem
              2. PID namespace
              3. Posix disk
          5. Docker  containerizers
            1. Docker containerizer setup
            2. Launching the Docker  containerizers
          6. Composing  containerizers
        6. Summary

    Product information

    • Title: Fast Data Processing Systems with SMACK Stack
    • Author(s): Raúl Estrada
    • Release date: December 2016
    • Publisher(s): Packt Publishing
    • ISBN: 9781786467201