Building Real-Time Data Pipelines

Book description

Traditional data processing infrastructures—especially those that support applications—weren’t designed for our mobile, streaming, and online world. This O’Reilly report examines how today’s distributed, in-memory database management systems (IMDBMS) enable you to make quick decisions based on real-time data.

In this report, executives from MemSQL Inc. provide options for using in-memory architectures to build real-time data pipelines. If you want to instantly track user behavior on websites or mobile apps, generate reports on a changing dataset, or detect anomalous activity in your system as it occurs, you’ll learn valuable lessons from some of the largest and most successful tech companies focused on in-memory databases.

  • Explore the architectural principles of modern in-memory databases
  • Understand what’s involved in moving from data silos to real-time data pipelines
  • Run transactions and analytics in a single database, without ETL
  • Minimize complexity by architecting a multipurpose data infrastructure
  • Learn guiding principles for developing an optimally architected operational system
  • Provide persistence and high availability mechanisms for real-time data
  • Choose an in-memory architecture flexible enough to scale across a variety of deployment options

Conor Doherty, Data Engineer at MemSQL, is responsible for creating content around database innovation, analytics, and distributed systems.

Gary Orenstein, Chief Marketing Officer at MemSQL, leads marketing strategy, product management, communications, and customer engagement.

Kevin White is the Director of of Operations and a content contributor at MemSQL.

Steven Camiña is a Principal Product Manager at MemSQL. His experience spans B2B enterprise solutions, including databases and middleware platforms.

Table of contents

  1. Introduction
  2. 1. When to Use In-Memory Database Management Systems (IMDBMS)
    1. Improving Traditional Workloads with In-Memory Databases
      1. Online Transaction Processing (OLTP)
      2. Online Analytical Processing (OLAP)
      3. HTAP: Bringing OLTP and OLAP Together
    2. Modern Workloads
    3. The Need for HTAP-Capable Systems
      1. In-Memory Enables HTAP
    4. Common Application Use Cases
      1. Real-Time Analytics
      2. Risk Management
      3. Personalization
      4. Portfolio Tracking
      5. Monitoring and Detection
      6. Conclusion
  3. 2. First Principles of Modern In-Memory Databases
    1. The Need for a New Approach
    2. Architectural Principles of Modern In-Memory Databases
      1. In-Memory
      2. Distributed Systems
      3. Relational with Multimodel
      4. Mixed Media
    3. Conclusion
  4. 3. Moving from Data Silos to Real-Time Data Pipelines
    1. The Enterprise Architecture Gap
    2. Real-Time Pipelines and Converged Processing
    3. Stream Processing, with Context
    4. Conclusion
  5. 4. Processing Transactions and Analytics in a Single Database
    1. Requirements for Converged Processing
      1. In-Memory Storage
      2. Access to Real-Time and Historical Data
      3. Compiled Query Execution Plans
      4. Granular Concurrency Control 
      5. Fault Tolerance and ACID Compliance
    2. Benefits of Converged Processing
      1. Enabling New Sources of Revenue
      2. Reducing Administrative and Development Overhead
      3. Simplifying Infrastructure
    3. Conclusion
  6. 5. Spark
    1. Background
    2. Characteristics of Spark 
    3. Understanding Databases and Spark 
    4. Other Use Cases
    5. Conclusion
  7. 6. Architecting Multipurpose Infrastructure
    1. Multimodal Systems
    2. Multimodel Systems
    3. Tiered Storage
    4. The Real-Time Trinity: Apache Kafka, Spark, and an Operational Database 
    5. Conclusion
  8. 7. Getting to Operational Systems
    1. Have Fewer Systems Doing More 
    2. Modern Technologies Enable Real-Time Programmatic Decision Making
    3. Modern Technologies Enable Ad-Hoc Reporting on Live Data
    4. Conclusion
  9. 8. Data Persistence and Availability
    1. Data Durability
    2. Data Availability
    3. Data Backups
    4. Conclusion
  10. 9. Choosing the Best Deployment Option
    1. Considerations for Bare Metal
    2. Virtual Machine (VM) and Container Considerations
      1. Orchestration Frameworks
    3. Considerations for Cloud or On-Premises Deployments
      1. Benefits of Cloud: Expansion and Flexibility
      2. Benefits of On-Premises: Control, Security, Performance Optimization, and Predictability
    4. Choosing the Right Storage Medium
      1. RAM
      2. SSD and Disk
    5. Deployment Conclusions
  11. 10. Conclusion
    1. Recommended Next Steps

Product information

  • Title: Building Real-Time Data Pipelines
  • Author(s): Gary Orenstein, Conor Doherty, Kevin White, Steven Camina
  • Release date: November 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491935491