book

The Cloud Data Lake

Name: The Cloud Data Lake
Author: Rukmani Gopalan
ISBN: 9781098116583

by Rukmani Gopalan

December 2022

Beginner to intermediate

244 pages

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why I Wrote This BookWho Should Read This Book?Introducing Klodars CorporationNavigating the BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Big Data—Beyond the Buzz
What Is Big Data?Elastic Data Infrastructure—The ChallengeCloud Computing FundamentalsCloud Computing TerminologyValue Proposition of the CloudCloud Data Lake ArchitectureLimitations of On-Premises Data Warehouse SolutionsWhat Is a Cloud Data Lake Architecture?Benefits of a Cloud Data Lake ArchitectureDefining Your Cloud Data Lake JourneySummary
2. Big Data Architectures on the Cloud
Why Klodars Corporation Moves to the CloudFundamentals of Cloud Data Lake ArchitecturesA Word on Variety of DataCloud Data Lake StorageBig Data Analytics EnginesCloud Data WarehousesModern Data Warehouse ArchitectureReference ArchitectureSample Use Case for a Modern Data Warehouse ArchitectureBenefits and Challenges of Modern Data Warehouse ArchitectureData Lakehouse ArchitectureReference Architecture for the Data LakehouseSample Use Case for Data Lakehouse ArchitectureBenefits and Challenges of the Data Lakehouse ArchitectureData Warehouses and Unstructured DataData MeshReference ArchitectureSample Use Case for a Data Mesh ArchitectureChallenges and Benefits of a Data Mesh ArchitectureWhat Is the Right Architecture for Me?Know Your CustomersKnow Your Business DriversConsider Your Growth and Future ScenariosDesign ConsiderationsHybrid ApproachesSummary
3. Design Considerations for Your Data Lake
Setting Up the Cloud Data Lake InfrastructureIdentify Your GoalsPlan Your Architecture and DeliverablesImplement the Cloud Data LakeRelease and OperationalizeOrganizing Data in Your Data LakeA Day in the Life of DataData Lake ZonesOrganization MechanismsIntroduction to Data GovernanceActors Involved in Data GovernanceData ClassificationMetadata Management, Data Catalog, and Data SharingData Access ManagementData Quality and ObservabilityData Governance at Klodars CorporationData Governance Wrap-UpManage Data Lake CostsDemystifying Data Lake Costs on the CloudData Lake Cost StrategySummary
4. Scalable Data Lakes
A Sneak Peek into ScalabilityWhat Is Scalability?Scale in Our Day-to-Day LifeScalability in Data Lake ArchitecturesInternals of Data Lake Processing SystemsData Copy InternalsELT/ETL Processing InternalsA Note on Other Interactive QueriesConsiderations for Scalable Data Lake SolutionsPick the Right Cloud OfferingsPlan for Peak CapacityData Formats and Job ProfileSummary
5. Optimizing Cloud Data Lake Architectures for Performance
Basics of Measuring PerformanceGoals and Metrics for PerformanceMeasuring PerformanceOptimizing for Faster PerformanceCloud Data Lake PerformanceSLAs, SLOs, and SLIsExample: How Klodars Corporation Managed Its SLAs, SLOs, and SLIsDrivers of PerformancePerformance Drivers for a Copy JobPerformance Drivers for a Spark JobOptimization Principles and Techniques for Performance TuningData FormatsData Organization and PartitioningChoosing the Right Configurations on Apache SparkMinimize Overheads with Data TransferPremium Offerings and PerformanceThe Case of Bigger Virtual MachinesThe Case of Flash StorageSummary
6. Deep Dive on Data Formats
Why Do We Need These Open Data Formats?Why Do We Need to Store Tabular Data?Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?Delta LakeWhy Was Delta Lake Founded?How Does Delta Lake Work?When Do You Use Delta Lake?Apache IcebergWhy Was Apache Iceberg Founded?How Does Apache Iceberg Work?When Do You Use Apache Iceberg?Apache HudiWhy Was Apache Hudi Founded?How Does Apache Hudi Work?When Do You Use Apache Hudi?Summary
7. Decision Framework for Your Architecture
Cloud Data Lake AssessmentCloud Data Lake Assessment QuestionnaireAnalysis for Your Cloud Data Lake AssessmentStarting from ScratchMigrating an Existing Data Lake or Data Warehouse to the CloudImproving an Existing Cloud Data LakePhase 1 of Decision Framework: AssessUnderstand Customer RequirementsUnderstand Opportunities for ImprovementKnow Your Business DriversComplete the Assess Phase by Prioritizing the RequirementsPhase 2 of Decision Framework: DefineFinalize the Design Choices for the Cloud Data LakePlan Your Cloud Data Lake Project DeliverablesPhase 3 of Decision Framework: ImplementPhase 4 of Decision Framework: OperationalizeSummary
8. Six Lessons for a Data Informed Future
Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data LakesLesson 2: With Great Power Comes Great Responsibility—Data Is No ExceptionLesson 3: Customers Lead Technology, Not the Other Way AroundLesson 4: Change Is Inevitable, so Be PreparedLesson 5: Build Empathy and Prioritize RuthlesslyLesson 6: Big Impact Does Not Happen OvernightSummary
A. Cloud Data Lake Decision Framework Template
Phase 1: Assess FrameworkPhase 2: Define FrameworkPlanning the Cloud Data Lake DeliverablesPhase 3: Implement Framework

Index
About the Author

Content preview from The Cloud Data Lake

Chapter 2. Big Data Architectures on the Cloud

Big data may mean more information, but it also means more false information.

Naseem Taleb

As we learned in Chapter 1, there are two key takeaways about cloud data lakes that set the foundation for this chapter:

A data lake approach starts with the ability to store and process any type of data regardless of its source, size, or structure, thereby allowing an organization to extract high-value insights from many disparate sources of data with variable value density (i.e., signal-to-noise ratio).
Building your data lake on the cloud involves a disaggregated architecture where you assemble different components of IaaS, PaaS, and SaaS solutions together.

What is important to remember is that building your cloud data lake solution also gives you a lot of options for architectures, each with its own set of strengths. This article on Future.com provides a comprehensive overview of the various components of a modern data architecture. In this chapter, we will dive deep into some of the more common architectural patterns, covering what they are as well as understanding the strengths of each of these architectures as they apply to a fictitious organization called Klodars Corporation.

Why Klodars Corporation Moves to the Cloud

Klodars Corporation is a thriving company that sells rain gear and other supplies in the Pacific Northwest region. The rapid growth in its business is driving its move to the cloud for the following reasons:

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098116576Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Cloud Data Lake

by Rukmani Gopalan

Chapter 2. Big Data Architectures on the Cloud

Why Klodars Corporation Moves to the Cloud

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.