book

Cost-Effective Data Pipelines

Name: Cost-Effective Data Pipelines
Author: Sev Leonard
ISBN: 9781492098645

by Sev Leonard

July 2023

Intermediate to advanced

286 pages

7h 52m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForWhat You Will LearnWhat This Book Is NotRunning ExampleConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Designing Compute for Data Pipelines
Understanding Availability of Cloud ComputeOutagesCapacity LimitsAccount LimitsInfrastructureLeveraging Different Purchasing Options in Pipeline DesignOn DemandSpot/InterruptibleContractual DiscountsContractual Discounts in the Real World: A Cautionary TaleRequirements Gathering for Compute DesignBusiness RequirementsArchitectural RequirementsRequirements-Gathering Example: HoD Batch IngestBenchmarkingInstance Family IdentificationCluster SizingMonitoringBenchmarking ExampleUndersizedOversizedRight-SizedSummaryRecommended Readings
2. Responding to Changes in Demand by Scaling Compute
Identifying Scaling OpportunitiesVariation in Data PipelinesScaling MetricsPipeline Scaling ExampleDesigning for ScalingImplementing Scaling PlansScaling MechanicsCommon Autoscaling PitfallsAutoscaling ExampleSummaryRecommended Readings
3. Data Organization in the Cloud
Cloud Storage CostsStorage at RestEgressData AccessCloud Storage OrganizationStorage Bucket StrategiesLifecycle ConfigurationsFile Structure DesignFile FormatsPartitioningCompactionSummaryRecommended Readings
4. Economical Pipeline Fundamentals
IdempotencyPreventing Data DuplicationTolerating Data DuplicationCheckpointingAutomatic RetriesRetry ConsiderationsRetry Levels in Data PipelinesData ValidationValidating Data CharacteristicsSchemasSummary
5. Setting Up Effective Development Environments
EnvironmentsSoftware EnvironmentsData EnvironmentsData Pipeline EnvironmentsEnvironment PlanningLocal DevelopmentContainersResource Dependency ReductionResource CleanupSummary
6. Software Development Strategies
Managing Different Coding EnvironmentsExample: A Multimodal PipelineExample: How Code Becomes Difficult to ChangeModular DesignSingle ResponsibilityDependency InversionModular Design with DataFramesConfigurable DesignSummaryRecommended Readings
7. Unit Testing
The Role of Unit Testing in Data PipelinesUnit Testing OverviewExample: Identifying Unit Testing NeedsPipeline Areas to Unit-TestData LogicConnectionsObservabilityData Modification ProcessesCloud ComponentsWorking with DependenciesInterfacesDataExample: Unit Testing PlanIdentifying Components to TestIdentifying DependenciesSummary
8. Mocks
Considerations for Replacing DependenciesPlacementDependency StabilityComplexity Versus CriticalityMocking Generic InterfacesResponsesRequestsConnectivityMocking Cloud ServicesBuilding Your Own MocksMocking with MotoTesting with DatabasesTest Database ExampleWorking with Test DatabasesSummaryFurther ExplorationMore Moto MocksMock Placement
9. Data for Testing
Working with Live DataBenefitsChallengesWorking with Synthetic DataBenefitsChallengesIs Synthetic Data the Right Approach?Manual Data GenerationAutomated Data GenerationSynthetic Data LibrariesSchema-Driven GenerationProperty-Based TestingSummary

10. Logging
Logging CostsImpact of ScaleImpact of Cloud Storage ElasticityReducing Logging CostsEffective LoggingSummary
11. Finding Your Way with Monitoring
Costs of Inadequate MonitoringGetting Lost in the WoodsNavigation to the RescueSystem MonitoringData VolumeThroughputConsumer LagWorker UtilizationResource MonitoringUnderstanding the BoundsUnderstanding Reliability ImpactsPipeline PerformancePipeline Stage DurationProfilingErrors to Watch Out ForQuery MonitoringMinimizing Monitoring CostsSummaryRecommended Readings
12. Essential Takeaways
An Ounce of Prevention Is Worth a Pound of CureReign In Compute SpendOrganize Your ResourcesDesign for InterruptionBuild In Data QualityChange Is the Only ConstantDesign for ChangeMonitor for ChangeParting Thoughts
Appendix. Preparing a Cloud Budget
It’s All About the DetailsHistorical DataEstimating for New ProjectsChanges That Impact CostsCreating a BudgetBudget SummaryChanges Between Previous and Next Budget PeriodsCost BreakdownCommunicating the BudgetSummary
Index
About the Author

Overview

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?

With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.

By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:

Reduce cloud spend with lower cost cloud service offerings and smart design strategies
Minimize waste without sacrificing performance by rightsizing compute resources
Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
Set up development and test environments that minimize cloud service dependencies
Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
Improve data quality and pipeline operation through validation and testing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492098638Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills