book

Cost-Effective Data Pipelines

Name: Cost-Effective Data Pipelines
Author: Sev Leonard
ISBN: 9781492098645

by Sev Leonard

July 2023

Intermediate to advanced

286 pages

7h 52m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForWhat You Will LearnWhat This Book Is NotRunning ExampleConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Designing Compute for Data Pipelines
Understanding Availability of Cloud ComputeOutagesCapacity LimitsAccount LimitsInfrastructureLeveraging Different Purchasing Options in Pipeline DesignOn DemandSpot/InterruptibleContractual DiscountsContractual Discounts in the Real World: A Cautionary TaleRequirements Gathering for Compute DesignBusiness RequirementsArchitectural RequirementsRequirements-Gathering Example: HoD Batch IngestBenchmarkingInstance Family IdentificationCluster SizingMonitoringBenchmarking ExampleUndersizedOversizedRight-SizedSummaryRecommended Readings
2. Responding to Changes in Demand by Scaling Compute
Identifying Scaling OpportunitiesVariation in Data PipelinesScaling MetricsPipeline Scaling ExampleDesigning for ScalingImplementing Scaling PlansScaling MechanicsCommon Autoscaling PitfallsAutoscaling ExampleSummaryRecommended Readings
3. Data Organization in the Cloud
Cloud Storage CostsStorage at RestEgressData AccessCloud Storage OrganizationStorage Bucket StrategiesLifecycle ConfigurationsFile Structure DesignFile FormatsPartitioningCompactionSummaryRecommended Readings
4. Economical Pipeline Fundamentals
IdempotencyPreventing Data DuplicationTolerating Data DuplicationCheckpointingAutomatic RetriesRetry ConsiderationsRetry Levels in Data PipelinesData ValidationValidating Data CharacteristicsSchemasSummary
5. Setting Up Effective Development Environments
EnvironmentsSoftware EnvironmentsData EnvironmentsData Pipeline EnvironmentsEnvironment PlanningLocal DevelopmentContainersResource Dependency ReductionResource CleanupSummary
6. Software Development Strategies
Managing Different Coding EnvironmentsExample: A Multimodal PipelineExample: How Code Becomes Difficult to ChangeModular DesignSingle ResponsibilityDependency InversionModular Design with DataFramesConfigurable DesignSummaryRecommended Readings
7. Unit Testing
The Role of Unit Testing in Data PipelinesUnit Testing OverviewExample: Identifying Unit Testing NeedsPipeline Areas to Unit-TestData LogicConnectionsObservabilityData Modification ProcessesCloud ComponentsWorking with DependenciesInterfacesDataExample: Unit Testing PlanIdentifying Components to TestIdentifying DependenciesSummary
8. Mocks
Considerations for Replacing DependenciesPlacementDependency StabilityComplexity Versus CriticalityMocking Generic InterfacesResponsesRequestsConnectivityMocking Cloud ServicesBuilding Your Own MocksMocking with MotoTesting with DatabasesTest Database ExampleWorking with Test DatabasesSummaryFurther ExplorationMore Moto MocksMock Placement
9. Data for Testing
Working with Live DataBenefitsChallengesWorking with Synthetic DataBenefitsChallengesIs Synthetic Data the Right Approach?Manual Data GenerationAutomated Data GenerationSynthetic Data LibrariesSchema-Driven GenerationProperty-Based TestingSummary

10. Logging
Logging CostsImpact of ScaleImpact of Cloud Storage ElasticityReducing Logging CostsEffective LoggingSummary
11. Finding Your Way with Monitoring
Costs of Inadequate MonitoringGetting Lost in the WoodsNavigation to the RescueSystem MonitoringData VolumeThroughputConsumer LagWorker UtilizationResource MonitoringUnderstanding the BoundsUnderstanding Reliability ImpactsPipeline PerformancePipeline Stage DurationProfilingErrors to Watch Out ForQuery MonitoringMinimizing Monitoring CostsSummaryRecommended Readings
12. Essential Takeaways
An Ounce of Prevention Is Worth a Pound of CureReign In Compute SpendOrganize Your ResourcesDesign for InterruptionBuild In Data QualityChange Is the Only ConstantDesign for ChangeMonitor for ChangeParting Thoughts
Appendix. Preparing a Cloud Budget
It’s All About the DetailsHistorical DataEstimating for New ProjectsChanges That Impact CostsCreating a BudgetBudget SummaryChanges Between Previous and Next Budget PeriodsCost BreakdownCommunicating the BudgetSummary
Index
About the Author

Content preview from Cost-Effective Data Pipelines

Chapter 12. Essential Takeaways

As you might imagine, my initial research for this book involved reading a lot of material about the cost of cloud services. From shell-shocked graduate students grappling with unexpected bills to large companies feeling trapped with substantial, expensive cloud deployments, it was clear that developing data pipelines in the cloud can be daunting.

It reminds me of learning to ride waves on a bodyboard when I was a kid. Similar to surfing, riding waves on a bodyboard requires that you develop a sense of when to start paddling to catch a wave at the right time. If you don’t time it right, you can miss the wave or get dunked when the wave crashes on top of you.

I got dunked a lot in the beginning, ending up with a nose full of saltwater, but gradually I got better. I developed a sense of how the strength of the undertow related to the incoming wave. I figured out how to angle the board to get a better ride. Sometimes I still got dunked.

This was how I felt when I started working in the cloud, a few years after I began working on data pipelines. The steep learning curve was no joke. A big motivation for writing this book was wishing I had something like it at the time. Data pipelines and cloud development are two big topics on their own, let alone together. Add to it the desire to cut costs and you’ve got quite a lot to digest.

In reflecting on the last 240ish pages, I want to wrap things up by distilling this volume down to what I consider to be the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492098638Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Cost-Effective Data Pipelines

by Sev Leonard

Chapter 12. Essential Takeaways

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.