O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Fast Data Processing Systems with SMACK Stack

Book Description

Combine the incredible powers of Spark, Mesos, Akka, Cassandra, and Kafka to build data processing platforms that can take on even the hardest of your data troubles!

About This Book

  • This highly practical guide shows you how to use the best of the big data technologies to solve your response-critical problems
  • Learn the art of making cheap-yet-effective big data architecture without using complex Greek-letter architectures
  • Use this easy-to-follow guide to build fast data processing systems for your organization

Who This Book Is For

If you are a developer, data architect, or a data scientist looking for information on how to integrate the Big Data stack architecture and how to choose the correct technology in every layer, this book is what you are looking for.

What You Will Learn

  • Design and implement a fast data Pipeline architecture
  • Think and solve programming challenges in a functional way with Scala
  • Learn to use Akka, the actors model implementation for the JVM
  • Make on memory processing and data analysis with Spark to solve modern business demands
  • Build a powerful and effective cluster infrastructure with Mesos and Docker
  • Manage and consume unstructured and No-SQL data sources with Cassandra
  • Consume and produce messages in a massive way with Kafka

In Detail

SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data. This highly practical guide will teach you how to integrate these technologies to create a highly efficient data analysis system for fast data processing.

We’ll start off with an introduction to SMACK and show you when to use it. First you’ll get to grips with functional thinking and problem solving using Scala. Next you’ll come to understand the Akka architecture. Then you’ll get to know how to improve the data structure architecture and optimize resources using Apache Spark.

Moving forward, you’ll learn how to perform linear scalability in databases with Apache Cassandra. You’ll grasp the high throughput distributed messaging systems using Apache Kafka. We’ll show you how to build a cheap but effective cluster infrastructure with Apache Mesos. Finally, you will deep dive into the different aspect of SMACK using a few case studies.

By the end of the book, you will be able to integrate all the components of the SMACK stack and use them together to achieve highly effective and fast data processing.

Style and approach

With the help of various industry examples, you will learn about the full stack of big data architecture, taking the important aspects in every technology. You will learn how to integrate the technologies to build effective systems rather than getting incomplete information on single technologies. You will learn how various open source technologies can be used to build cheap and fast data processing systems with the help of various industry examples

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Fast Data Processing Systems with SMACK Stack
    1. Fast Data Processing Systems with SMACK Stack
    2. Credits
    3. About the Author
    4. About the Reviewers
    5. www.PacktPub.com
      1. Why subscribe?
    6. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    7. 1. An Introduction to SMACK
      1. Modern data-processing challenges
      2. The data-processing pipeline architecture
        1. The NoETL manifesto
        2. Lambda architecture
        3. Hadoop
      3. SMACK technologies
        1. Apache Spark
        2. Akka
        3. Apache Cassandra
        4. Apache Kafka
        5. Apache Mesos
      4. Changing the data center operations
        1. From scale-up to scale-out
        2. The open-source predominance
        3. Data store diversification
        4. Data gravity and data locality
        5. DevOps rules
      5. Data expert profiles
        1. Data architects
        2. Data engineers
        3. Data analysts
        4. Data scientists
      6. Is SMACK for me?
      7. Summary
    8. 2. The Model - Scala and Akka
      1. The language - Scala
        1. Kata 1 - The collections hierarchy
          1. Sequence
          2. Map
          3. Set
        2. Kata 2 - Choosing the right collection
          1. Sequence
          2. Map
          3. Set
        3. Kata 3 - Iterating with foreach
        4. Kata 4 - Iterating with for
        5. Kata 5 - Iterators
        6. Kata 6 - Transforming with map
        7. Kata 7 - Flattening
        8. Kata 8 - Filtering
        9. Kata 9 - Subsequences
        10. Kata 10 - Splitting
        11. Kata 11 - Extracting unique elements
        12. Kata 12 - Merging
        13. Kata 13 - Lazy views
        14. Kata 14 - Sorting
        15. Kata 15 - Streams
        16. Kata 16 - Arrays
        17. Kata 17 - ArrayBuffer
        18. Kata 18 - Queues
        19. Kata 19 - Stacks
        20. Kata 20 - Ranges
      2. The model - Akka
        1. The Actor Model in a nutshell
        2. Kata 21 - Actors
          1. The actor system
          2. Actor reference
        3. Kata 22 - Actor communication
        4. Kata 23 - Actor life cycle
        5. Kata 24 - Starting actors
        6. Kata 25 - Stopping actors
        7. Kata 26 - Killing actors
        8. Kata 27 - Shutting down the actor system
        9. Kata 28 - Actor monitoring
        10. Kata 29 - Looking up actors
      3. Summary
    9. 3. The Engine - Apache Spark
      1. Spark in single mode
        1. Downloading Apache Spark
        2. Testing Apache Spark
      2. Spark core concepts
      3. Resilient distributed datasets
        1. Running Spark applications
        2. Initializing the Spark context
        3. Spark applications
        4. Running programs
        5. RDD operation
          1. Transformations
          2. Actions
        6. Persistence (caching)
      4. Spark in cluster mode
        1. Runtime architecture
          1. Driver
            1. Dividing a program into tasks
            2. Scheduling tasks on executors
          2. Executor
          3. Cluster manager
          4. Program execution
          5. Application deployment
        2. Standalone cluster manager
          1. Launching the standalone manager
          2. Submitting our application
          3. Configuring resources
          4. Working in the cluster
      5. Spark Streaming
        1. Spark Streaming architecture
        2. Transformations
          1. Stateless transformations
          2. Stateful transformations
            1. Windowed operations
            2. Update state by key
        3. Output operations
        4. Fault-tolerant Spark Streaming
          1. Checkpointing
        5. Spark Streaming performance
          1. Parallelism level
        6. Window size and batch size
        7. Garbage collector
      6. Summary
    10. 4. The Storage - Apache Cassandra
      1. A bit of history
      2. NoSQL
        1. NoSQL or SQL?
        2. CAP Brewer's theorem
      3. Apache Cassandra installation
        1. Data model
        2. Data storage
        3. Installation
        4. DataStax OpsCenter
        5. Creating a key space
      4. Authentication and authorization (roles)
        1. Setting up a simple authentication and authorization
      5. Backup
        1. Compression
      6. Recovery
        1. Restart node
        2. Printing schema
        3. Logs
        4. Configuring log4j
        5. Log file rotation
        6. User activity log
        7. Transaction log
        8. SQL dump
        9. CQL
          1. CQL commands
        10. DBMS Cluster
          1. Deleting the database
            1. CLI delete commands
            2. CQL shell delete commands
        11. DB and DBMS optimization
        12. Bloom filter
        13. Data cache
        14. Java heap tune up
        15. Java garbage collection tune up
        16. Views, triggers, and stored procedures
        17. Client-server architecture
          1. Drivers
      7. Spark-Cassandra connector
        1. Installing the connector
        2. Establishing the connection
        3. Using the connector
      8. Summary
    11. 5. The Broker - Apache Kafka
      1. Introducing Kafka
        1. Features of Apache Kafka
        2. Born to be fast data
        3. Use cases
      2. Installation
        1. Installing Java
        2. Installing Kafka
        3. Importing Kafka
      3. Cluster
        1. Single node - single broker cluster
          1. Starting Zookeeper
          2. Starting the broker
          3. Creating a topic
          4. Starting a producer
          5. Starting a consumer
        2. Single node - Multiple broker cluster
          1. Starting the brokers
          2. Creating a topic
          3. Starting a producer
          4. Starting a consumer
        3. Multiple node - multiple broker cluster
        4. Broker properties
      4. Architecture
        1. Segment files
        2. Offset
        3. Leaders
        4. Groups
        5. Log compaction
        6. Kafka design
        7. Message compression
        8. Replication
          1. Asynchronous replication
          2. Synchronous replication
      5. Producers
        1. Producer API
        2. Scala producers
          1. Step 1: Import classes
          2. Step 2: Define properties
          3. Step 3: Build and send the message
          4. Step 4: Create the topic
          5. Step 5: Compile the producer
          6. Step 6: Run the producer
          7. Step 7: Run a consumer
        3. Producers with custom partitioning
          1. Step 1: Import classes
          2. Step 2: Define properties
          3. Step 3: Implement the partitioner class
          4. Step 4: Build and send the message
          5. Step 5: Create the topic
          6. Step 6: Compile the programs
          7. Step 7: Run the producer
          8. Step 8: Run a consumer
        4. Producer properties
      6. Consumers
        1. Consumer API
        2. Simple Scala consumers
          1. Step 1: Import classes
          2. Step 2: Define properties
          3. Step 3: Code the SimpleConsumer
          4. Step 4: Create the topic
          5. Step 5: Compile the program
          6. Step 6: Run the producer
          7. Step 7: Run the consumer
        3. Multithread Scala consumers
          1. Step 1: Import classes
          2. Step 2: Define properties
          3. Step 3: Code the MultiThreadConsumer
          4. Step 4: Create the topic
          5. Step 5: Compile the program
          6. Step 6: Run the producer
          7. Step 7: Run the consumer
            1. Consumer properties
      7. Integration
        1. Integration with Apache Spark
      8. Administration
        1. Cluster tools
        2. Adding servers
        3. Kafka topic tools
        4. Cluster mirroring
      9. Summary
    12. 6. The Manager - Apache Mesos
      1. The Apache Mesos architecture
        1. Frameworks
        2. Existing Mesos frameworks
          1. Frameworks for long running applications
          2. Frameworks for scheduling
          3. Frameworks for storage
        3. Attributes and resources
          1. Attributes
          2. Resources
        4. The Apache Mesos API
          1. Messages
          2. The Executor API
          3. Executor Driver API
          4. The Scheduler API
          5. The Scheduler Driver API
      2. Resource allocation
        1. The DRF algorithm
        2. Weighted DRF algorithm
        3. Resource configuration
        4. Resource reservation
          1. Static reservation
          2. Defining roles
          3. Assigning frameworks to roles
          4. Setting policies
          5. Dynamic reservation
          6. The reserve operation
          7. The unreserve operation
          8. HTTP reserve
          9. HTTP unreserve
      3. Running a Mesos cluster on AWS
        1. AWS instance types
          1. AWS instances launching
        2. Installing Mesos on AWS
        3. Downloading Mesos
        4. Building Mesos
          1. Launching several instances
      4. Running a Mesos cluster on a private data center
        1. Mesos installation
          1. Setting up the environment
          2. Start the master
          3. Start the slaves
          4. Process automation
        2. Common Mesos issues
          1. Missing library dependencies
          2. Directory permissions
          3. Missing library
          4. Debugging
          5. Directory structure
          6. Slaves not connecting with masters
          7. Multiple slaves on the same machine
      5. Scheduling and management frameworks
        1. Marathon
          1. Marathon installation
          2. Installing Apache Zookeeper
          3. Running Marathon in local mode
          4. Multi-node Marathon installation
          5. Running a test application from the web UI
          6. Application scaling
          7. Terminating the application
        2. Chronos
          1. Chronos installation
          2. Job scheduling
        3. Chronos and Marathon
          1. Chronos REST API
            1. Listing running jobs
            2. Starting a job manually
            3. Adding a job
            4. Deleting a job
            5. Deleting all the job tasks
          2. Marathon REST API
            1. Listing the running applications
            2. Adding an application
            3. Changing the application configuration
            4. Deleting the application
      6. Apache Aurora
        1. Installing Aurora
      7. Singularity
        1. Singularity installation
          1. The Singularity configuration file
      8. Apache Spark on Apache Mesos
        1. Submitting jobs in client mode
        2. Submitting jobs in cluster mode
        3. Advanced configuration
      9. Apache Cassandra on Apache Mesos
        1. Advanced configuration
      10. Apache Kafka on Apache Mesos
        1. Kafka log management
      11. Summary
    13. 7. Study Case 1 - Spark and Cassandra
      1. Spark Cassandra connector
        1. Requisites
        2. Preparing Cassandra
        3. SparkContext setup
        4. Cassandra and Spark Streaming
        5. Spark Streaming setup
        6. Cassandra setup
        7. Streaming context creation
        8. Stream creation
          1. Kafka Streams
          2. Akka Streams
          3. Enabling Cassandra
          4. Write the Stream to Cassandra
          5. Read the Stream from Cassandra
        9. Saving datasets to Cassandra
          1. Saving a collection of tuples to Cassandra
          2. Saving collections to Cassandra
          3. Modifying collections
        10. Saving objects of Cassandra (user defined types)
        11. Scala options to Cassandra options conversion
        12. Saving RDDs as new tables
        13. Cluster deployment
        14. Spark Cassandra use cases
      2. Study case: The Calliope project
        1. Installing Calliope
        2. CQL3
          1. Read from Cassandra with CQL3
          2. Write to Cassandra with CQL3
        3. Thrift
          1. Read from Cassandra with Thrift
          2. Write to Cassandra with Thrift
        4. Calliope SQL context creation
        5. Calliope SQL Configuration
          1. Loading Cassandra tables programmatically
      3. Summary
    14. 8. Study Case 2 - Connectors
      1. Akka and Cassandra
        1. Writing to Cassandra
        2. Reading from Cassandra
        3. Connecting to Cassandra
        4. Scanning tweets
        5. Testing the scanner
      2. Akka and Spark
      3. Kafka and Akka
      4. Kafka and Cassandra
      5. Summary
    15. 9. Study Case 3 - Mesos and Docker
      1. Mesos frameworks API
        1. Authentication, authorization, and access control
        2. Framework authentication
        3. Authentication configuration
        4. Framework authorization
        5. Access control lists
      2. Spark Mesos run modes
        1. Coarse-grained
        2. Fine-grained
      3. Apache Mesos API
        1. Scheduler HTTP API
          1. Requests
            1. SUBSCRIBE
            2. TEARDOWN
            3. ACCEPT
            4. DECLINE
            5. REVIVE
            6. KILL
            7. SHUTDOWN
            8. ACKNOWLEDGE
            9. RECONCILE
            10. MESSAGE
            11. REQUEST
          2. Responses
            1. SUBSCRIBED
            2. OFFERS
            3. RESCIND
            4. UPDATE
            5. MESSAGE
            6. FAILURE
            7. ERROR
            8. HEARTBEAT
      4. Mesos containerizers
        1. Containers
      5. Docker containerizers
        1. Containers and containerizers
        2. Types of containerizers
        3. Creating containerizers
        4. Mesos containerizer
          1. Launching Mesos containerizer
          2. Architecture of Mesos  containerizer
            1. Shared filesystem
            2. PID namespace
            3. Posix disk
        5. Docker  containerizers
          1. Docker containerizer setup
          2. Launching the Docker  containerizers
        6. Composing  containerizers
      6. Summary