Rebuilding Reliable Data Pipelines Through Modern Tools

Book description

When data-driven applications fail, identifying the cause is both challenging and time-consuming—especially as data pipelines become more and more complex. Hunting for the root cause of application failure from messy, raw, and distributed logs is difficult for performance experts and a nightmare for data operations teams. This report examines DataOps processes and tools that enable you to manage modern data pipelines efficiently.

Author Ted Malaska describes a data operations framework and shows you the importance of testing and monitoring to plan, rebuild, automate, and then manage robust data pipelines—whether it’s in the cloud, on premises, or in a hybrid configuration. You’ll also learn ways to apply performance monitoring software and AI to your data pipelines in order to keep your applications running reliably.

You’ll learn:

  • How performance management software can reduce the risk of running modern data applications
  • Methods for applying AI to provide insights, recommendations, and automation to operationalize big data systems and data applications
  • How to plan, migrate, and operate big data workloads and data pipelines in the cloud and in hybrid deployment models

Table of contents

  1. 1. Introduction
    1. Who Should Read This Book?
      1. Data Architects
      2. Data Engineers
      3. Data Analysts
      4. Data Scientists
      5. Product Managers
      6. Data Operations Engineers
    2. Outline and Goals of This Book
      1. Chapter 2: How We Got Here
      2. Chapter 3: The Data Ecosystem Landscape
      3. Chapter 4: Data Process at Its Core
      4. Chapter 5: Identifying Job Issues
      5. Chapter 6: Identifying Workflow and Pipeline Issues
      6. Chapter 7: Watching and Learning from Your Jobs
      7. Chapter 8: Closing Thoughts
  2. 2. How We Got Here
    1. Excel Spreadsheets
    2. Databases
    3. Appliances
    4. Extract, Transform, and Load Platforms
      1. The Processing Pipeline
    5. Kafka, Spark, Hadoop, SQL, and NoSQL platforms
    6. Cloud, On-Premises, and Hybrid Environments
    7. Machine Learning, Artificial Intelligence, Advanced Business Intelligence, Internet of Things
    8. Producers and Considerations
    9. Consumers and Considerations
    10. Summary
  3. 3. The Data Ecosystem Landscape
    1. The Chef, the Refrigerator, and the Oven
    2. The Chef: Design Time and Metadata Management
    3. The Refrigerator: Publishing and Persistence
    4. The Oven: Access and Processing
      1. Getting Our Data
      2. How Do You Process Data?
      3. Where Do You Process Data?
    5. Ecosystem and Data Pipelines
      1. The Chef and the Pipeline
      2. The Refrigerator and the Pipeline
      3. The Oven and the Pipeline
    6. Summary
  4. 4. Data Processing at Its Core
    1. What Is a DAG?
    2. Single-Job DAGs
      1. DAGs as Recipes
      2. DAG Operations/Transformations/Actions
    3. Pipeline DAGs
      1. No Direct Lines
      2. Start and End with Storage
      3. Storage Reuse
    4. Summary
  5. 5. Identifying Job Issues
    1. Bottlenecks
      1. Round Tripping
      2. Inputs and Outputs
      3. Over the Wire
      4. Parallelism
      5. Driver Operations
      6. Skew
      7. Nonlinear Operations
    2. Failures
      1. Input Failures
      2. Environment Errors
      3. Resource Failures
    3. Summary
  6. 6. Identifying Workflow and Pipeline Issues
    1. Considerations of Budgets and Isolations
      1. Node Isolation
    2. Container Isolation
      1. Scheduling Containers
      2. Scaling Considerations for Containers
      3. Limits to Isolation in Containers
    3. Process Isolation
    4. Considerations of Dependent Jobs
      1. Dependency Management
    5. Summary
  7. 7. Watching and Learning from Your Jobs
    1. Culture Considerations of Collecting Data Processing Metrics
      1. Make It Piece by Piece
      2. Make It Easy
      3. Make It a Requirement
      4. When Things Go South: Asking for Data
    2. What Metrics to Collect
      1. Job Execution Events
      2. Job Execution Information
      3. Job Meta Information
      4. Data About the Data Going In and Out of Jobs
      5. Job Optimization Information
      6. Resource Cost
      7. Operational Cost
      8. Labeling Operational Data
      9. Technique for Capturing Labeling
  8. 8. Closing Thoughts

Product information

  • Title: Rebuilding Reliable Data Pipelines Through Modern Tools
  • Author(s): Ted Malaska
  • Release date: July 2019
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492058168