book

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

October 2023

Intermediate to advanced

359 pages

10h 21m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Do You Need a Cloud Data Platform?Who Is This Book For?Organization of This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Modernizing Your Data Platform: An Introductory Overview
The Data LifecycleThe Journey to WisdomWater Pipes AnalogyCollectStoreProcess/TransformAnalyze/VisualizeActivateLimitations of Traditional ApproachesAntipattern: Breaking Down Silos Through ETLAntipattern: Centralization of ControlAntipattern: Data Marts and HadoopCreating a Unified Analytics PlatformCloud Instead of On-PremisesDrawbacks of Data Marts and Data LakesConvergence of DWHs and Data LakesHybrid CloudReasons Why Hybrid Is NecessaryChallenges of Hybrid CloudWhy Hybrid Can WorkEdge ComputingApplying AIMachine LearningUses of MLWhy Cloud for AI?Cloud InfrastructureDemocratizationReal TimeMLOpsCore PrinciplesSummary
2. Strategic Steps to Innovate with Data
Step 1: Strategy and PlanningStrategic GoalsIdentify StakeholdersChange ManagementStep 2: Reduce Total Cost of Ownership by Adopting a Cloud ApproachWhy Cloud Costs LessHow Much Are the Savings?When Does Cloud Help?Step 3: Break Down SilosUnifying Data AccessChoosing StorageSemantic LayerStep 4: Make Decisions in Context FasterBatch to StreamContextual InformationCost ManagementStep 5: Leapfrog with Packaged AI SolutionsPredictive AnalyticsUnderstanding and Generating Unstructured DataPersonalizationPackaged SolutionsStep 6: Operationalize AI-Driven WorkflowsIdentifying the Right Balance of Automation and AssistanceBuilding a Data CulturePopulating Your Data Science TeamStep 7: Product Management for DataApplying Product Management Principles to Data1. Understand and Maintain a Map of Data Flows in the Enterprise2. Identify Key Metrics3. Agreed Criteria, Committed Roadmap, and Visionary Backlog4. Build for the Customers You Have5. Don’t Shift the Burden of Change Management6. Interview Customers to Discover Their Data Needs7. Whiteboard and Prototype Extensively8. Build Only What Will Be Used Immediately9. Standardize Common Entities and KPIs10. Provide Self-Service Capabilities in Your Data PlatformSummary
3. Designing for Your Data Team
Classifying Data Processing OrganizationsData Analysis–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Engineering–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Science–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkSummary
4. A Migration Framework
Modernize Data WorkflowsHolistic ViewModernize WorkflowsTransform the Workflow ItselfA Four-Step Migration FrameworkPrepare and DiscoverAssess and PlanExecuteOptimizeEstimating the Overall Cost of the SolutionAudit of the Existing InfrastructureRequest for Information/Proposal and QuotationProof of Concept/Minimum Viable ProductSetting Up Security and Data GovernanceFrameworkArtifactsGovernance over the Life of the DataSchema, Pipeline, and Data MigrationSchema MigrationPipeline MigrationData MigrationMigration StagesSummary
5. Architecting a Data Lake
Data Lake and the Cloud—A Perfect MarriageChallenges with On-Premises Data LakesBenefits of Cloud Data LakesDesign and ImplementationBatch and StreamData CatalogHadoop LandscapeCloud Data Lake Reference ArchitectureIntegrating the Data Lake: The Real SuperpowerAPIs to Extend the LakeThe Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta LakeInteractive Analytics with NotebooksDemocratizing Data Processing and ReportingBuild Trust in the DataData Ingestion Is Still an IT MatterML in the Data LakeTraining on Raw DataPredicting in the Data LakeSummary
6. Innovating with an Enterprise Data Warehouse
A Modern Data PlatformOrganizational GoalsTechnological ChallengesTechnology Trends and ToolsHub-and-Spoke ArchitectureData IngestBusiness IntelligenceTransformationsOrganizational StructureDWH to Enable Data ScientistsQuery InterfaceStorage APIML Without Moving Your DataSummary
7. Converging to a Lakehouse
The Need for a Unique ArchitectureUser PersonasAntipattern: Disconnected SystemsAntipattern: Duplicated DataConverged ArchitectureTwo FormsLakehouse on Cloud StorageSQL-First LakehouseThe Benefits of ConvergenceSummary
8. Architectures for Streaming
The Value of StreamingIndustry Use CasesStreaming Use CasesStreaming IngestStreaming ETLStreaming ELTStreaming InsertStreaming from Edge Devices (IoT)Streaming SinksReal-Time DashboardsLive QueryingMaterialize Some ViewsStream AnalyticsTime-Series AnalyticsClickstream AnalyticsAnomaly DetectionResilient StreamingContinuous Intelligence Through MLTraining Model on Streaming DataStreaming ML InferenceAutomated ActionsSummary
9. Extending a Data Platform Using Hybrid and Edge
Why Multicloud?A Single Cloud Is Simpler and Cost-EffectiveMulticloud Is InevitableMulticloud Could Be StrategicMulticloud Architectural PatternsSingle Pane of GlassWrite Once, Run AnywhereBursting from On Premises to CloudPass-Through from On Premises to CloudData Integration Through StreamingAdopting MulticloudFrameworkTime ScaleDefine a Target Multicloud ArchitectureWhy Edge Computing?Bandwidth, Latency, and Patchy ConnectivityUse CasesBenefitsChallengesEdge Computing Architectural PatternsSmart DevicesSmart GatewaysML ActivationAdopting Edge ComputingThe Initial ContextThe ProjectThe Final Outcomes and Next StepsSummary

10. AI Application Architecture
Is This an AI/ML Problem?Subfields of AIGenerative AIProblems Fit for MLBuy, Adapt, or Build?Data ConsiderationsWhen to BuyWhat Can You Buy?How Adapting WorksAI ArchitecturesUnderstanding Unstructured DataGenerating Unstructured DataPredicting OutcomesForecasting ValuesAnomaly DetectionPersonalizationAutomationResponsible AIAI PrinciplesML FairnessExplainabilitySummary
11. Architecting an ML Platform
ML ActivitiesDeveloping ML ModelsLabeling EnvironmentDevelopment EnvironmentUser EnvironmentPreparing DataTraining ML ModelsDeploying ML ModelsDeploying to an EndpointEvaluate ModelHybrid and MulticloudTraining-Serving SkewAutomationAutomate Training and DeploymentOrchestration with PipelinesContinuous Evaluation and TrainingChoosing the ML FrameworkTeam SkillsTask ConsiderationsUser-CentricSummary
12. Data Platform Modernization: A Model Case
New Technology for a New EraThe Need for ChangeIt Is Not Only a Matter of TechnologyThe Beginning of the JourneyThe Current EnvironmentThe Target EnvironmentThe PoC Use CaseThe RFP Responses Proposed by Cloud VendorsThe Target EnvironmentThe Approach on MigrationThe RFP Evaluation ProcessThe Scope of the PoCThe Execution of the PoCThe Final DecisionPerorationSummary
Index
About the Authors

Content preview from Architecting Data and Machine Learning Platforms

Chapter 5. Architecting a Data Lake

A data lake is the part of the data platform that captures raw, ungoverned data from across an organization and supports compute tools from the Apache ecosystem. In this chapter, we will go into more detail about this concept, which is important when designing modern data platforms. The cloud can provide a boost to the different use cases that can be implemented on top of it, as you will read throughout the chapter.

We will start with a recap of why you might want to store raw, ungoverned data that only supports basic compute. Then, we discuss architecture design and implementation details in the cloud. Even though data lakes were originally intended only for basic data processing, it is now possible to democratize data access and reporting using just a data lake—because of integrations with other solutions through APIs and connectors, the data within a data lake can be made much more fit for purpose. We will finally take a bird’s-eye perspective on a very common way to speed up analysis and experimentation with data within an organization by leveraging data science notebooks.

Data Lake and the Cloud—A Perfect Marriage

Data helps organizations make better decisions, faster. It’s the center of everything from applications to security, and more data means more need for processing power, which cloud solutions can provide.

Challenges with On-Premises Data Lakes

Organizations need a place to store all types of data, including unstructured data ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781098151607Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design