Cost-Effective Data Pipelines

Book description

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?

With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.

By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:

  • Reduce cloud spend with lower cost cloud service offerings and smart design strategies
  • Minimize waste without sacrificing performance by rightsizing compute resources
  • Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
  • Set up development and test environments that minimize cloud service dependencies
  • Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
  • Improve data quality and pipeline operation through validation and testing

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who This Book Is For
    2. What You Will Learn
    3. What This Book Is Not
    4. Running Example
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Online Learning
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Designing Compute for Data Pipelines
    1. Understanding Availability of Cloud Compute
      1. Outages
      2. Capacity Limits
      3. Account Limits
      4. Infrastructure
    2. Leveraging Different Purchasing Options in Pipeline Design
      1. On Demand
      2. Spot/Interruptible
      3. Contractual Discounts
      4. Contractual Discounts in the Real World: A Cautionary Tale
    3. Requirements Gathering for Compute Design
      1. Business Requirements
      2. Architectural Requirements
      3. Requirements-Gathering Example: HoD Batch Ingest
    4. Benchmarking
      1. Instance Family Identification
      2. Cluster Sizing
      3. Monitoring
    5. Benchmarking Example
      1. Undersized
      2. Oversized
      3. Right-Sized
    6. Summary
    7. Recommended Readings
  3. 2. Responding to Changes in Demand by Scaling Compute
    1. Identifying Scaling Opportunities
      1. Variation in Data Pipelines
      2. Scaling Metrics
      3. Pipeline Scaling Example
    2. Designing for Scaling
    3. Implementing Scaling Plans
      1. Scaling Mechanics
      2. Common Autoscaling Pitfalls
    4. Autoscaling Example
    5. Summary
    6. Recommended Readings
  4. 3. Data Organization in the Cloud
    1. Cloud Storage Costs
      1. Storage at Rest
      2. Egress
      3. Data Access
    2. Cloud Storage Organization
      1. Storage Bucket Strategies
      2. Lifecycle Configurations
    3. File Structure Design
      1. File Formats
      2. Partitioning
      3. Compaction
    4. Summary
    5. Recommended Readings
  5. 4. Economical Pipeline Fundamentals
    1. Idempotency
      1. Preventing Data Duplication
      2. Tolerating Data Duplication
    2. Checkpointing
    3. Automatic Retries
      1. Retry Considerations
      2. Retry Levels in Data Pipelines
    4. Data Validation
      1. Validating Data Characteristics
      2. Schemas
    5. Summary
  6. 5. Setting Up Effective Development Environments
    1. Environments
      1. Software Environments
      2. Data Environments
      3. Data Pipeline Environments
      4. Environment Planning
    2. Local Development
      1. Containers
      2. Resource Dependency Reduction
      3. Resource Cleanup
    3. Summary
  7. 6. Software Development Strategies
    1. Managing Different Coding Environments
      1. Example: A Multimodal Pipeline
    2. Example: How Code Becomes Difficult to Change
    3. Modular Design
      1. Single Responsibility
      2. Dependency Inversion
      3. Modular Design with DataFrames
    4. Configurable Design
    5. Summary
    6. Recommended Readings
  8. 7. Unit Testing
    1. The Role of Unit Testing in Data Pipelines
      1. Unit Testing Overview
      2. Example: Identifying Unit Testing Needs
    2. Pipeline Areas to Unit-Test
      1. Data Logic
      2. Connections
      3. Observability
      4. Data Modification Processes
      5. Cloud Components
    3. Working with Dependencies
      1. Interfaces
      2. Data
    4. Example: Unit Testing Plan
      1. Identifying Components to Test
      2. Identifying Dependencies
    5. Summary
  9. 8. Mocks
    1. Considerations for Replacing Dependencies
      1. Placement
      2. Dependency Stability
      3. Complexity Versus Criticality
    2. Mocking Generic Interfaces
      1. Responses
      2. Requests
      3. Connectivity
    3. Mocking Cloud Services
      1. Building Your Own Mocks
      2. Mocking with Moto
    4. Testing with Databases
      1. Test Database Example
      2. Working with Test Databases
    5. Summary
    6. Further Exploration
      1. More Moto Mocks
      2. Mock Placement
  10. 9. Data for Testing
    1. Working with Live Data
      1. Benefits
      2. Challenges
    2. Working with Synthetic Data
      1. Benefits
      2. Challenges
      3. Is Synthetic Data the Right Approach?
    3. Manual Data Generation
    4. Automated Data Generation
      1. Synthetic Data Libraries
      2. Schema-Driven Generation
    5. Property-Based Testing
    6. Summary
  11. 10. Logging
    1. Logging Costs
      1. Impact of Scale
      2. Impact of Cloud Storage Elasticity
    2. Reducing Logging Costs
    3. Effective Logging
    4. Summary
  12. 11. Finding Your Way with Monitoring
    1. Costs of Inadequate Monitoring
      1. Getting Lost in the Woods
      2. Navigation to the Rescue
    2. System Monitoring
      1. Data Volume
      2. Throughput
      3. Consumer Lag
      4. Worker Utilization
    3. Resource Monitoring
      1. Understanding the Bounds
      2. Understanding Reliability Impacts
    4. Pipeline Performance
      1. Pipeline Stage Duration
      2. Profiling
      3. Errors to Watch Out For
    5. Query Monitoring
    6. Minimizing Monitoring Costs
    7. Summary
    8. Recommended Readings
  13. 12. Essential Takeaways
    1. An Ounce of Prevention Is Worth a Pound of Cure
      1. Reign In Compute Spend
      2. Organize Your Resources
      3. Design for Interruption
      4. Build In Data Quality
    2. Change Is the Only Constant
      1. Design for Change
      2. Monitor for Change
    3. Parting Thoughts
  14. Appendix. Preparing a Cloud Budget
    1. It’s All About the Details
      1. Historical Data
      2. Estimating for New Projects
      3. Changes That Impact Costs
    2. Creating a Budget
      1. Budget Summary
      2. Changes Between Previous and Next Budget Periods
      3. Cost Breakdown
    3. Communicating the Budget
    4. Summary
  15. Index
  16. About the Author

Product information

  • Title: Cost-Effective Data Pipelines
  • Author(s): Sev Leonard
  • Release date: July 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492098645