book

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

October 2023

Intermediate to advanced

359 pages

10h 21m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Do You Need a Cloud Data Platform?Who Is This Book For?Organization of This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Modernizing Your Data Platform: An Introductory Overview
The Data LifecycleThe Journey to WisdomWater Pipes AnalogyCollectStoreProcess/TransformAnalyze/VisualizeActivateLimitations of Traditional ApproachesAntipattern: Breaking Down Silos Through ETLAntipattern: Centralization of ControlAntipattern: Data Marts and HadoopCreating a Unified Analytics PlatformCloud Instead of On-PremisesDrawbacks of Data Marts and Data LakesConvergence of DWHs and Data LakesHybrid CloudReasons Why Hybrid Is NecessaryChallenges of Hybrid CloudWhy Hybrid Can WorkEdge ComputingApplying AIMachine LearningUses of MLWhy Cloud for AI?Cloud InfrastructureDemocratizationReal TimeMLOpsCore PrinciplesSummary
2. Strategic Steps to Innovate with Data
Step 1: Strategy and PlanningStrategic GoalsIdentify StakeholdersChange ManagementStep 2: Reduce Total Cost of Ownership by Adopting a Cloud ApproachWhy Cloud Costs LessHow Much Are the Savings?When Does Cloud Help?Step 3: Break Down SilosUnifying Data AccessChoosing StorageSemantic LayerStep 4: Make Decisions in Context FasterBatch to StreamContextual InformationCost ManagementStep 5: Leapfrog with Packaged AI SolutionsPredictive AnalyticsUnderstanding and Generating Unstructured DataPersonalizationPackaged SolutionsStep 6: Operationalize AI-Driven WorkflowsIdentifying the Right Balance of Automation and AssistanceBuilding a Data CulturePopulating Your Data Science TeamStep 7: Product Management for DataApplying Product Management Principles to Data1. Understand and Maintain a Map of Data Flows in the Enterprise2. Identify Key Metrics3. Agreed Criteria, Committed Roadmap, and Visionary Backlog4. Build for the Customers You Have5. Don’t Shift the Burden of Change Management6. Interview Customers to Discover Their Data Needs7. Whiteboard and Prototype Extensively8. Build Only What Will Be Used Immediately9. Standardize Common Entities and KPIs10. Provide Self-Service Capabilities in Your Data PlatformSummary
3. Designing for Your Data Team
Classifying Data Processing OrganizationsData Analysis–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Engineering–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Science–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkSummary
4. A Migration Framework
Modernize Data WorkflowsHolistic ViewModernize WorkflowsTransform the Workflow ItselfA Four-Step Migration FrameworkPrepare and DiscoverAssess and PlanExecuteOptimizeEstimating the Overall Cost of the SolutionAudit of the Existing InfrastructureRequest for Information/Proposal and QuotationProof of Concept/Minimum Viable ProductSetting Up Security and Data GovernanceFrameworkArtifactsGovernance over the Life of the DataSchema, Pipeline, and Data MigrationSchema MigrationPipeline MigrationData MigrationMigration StagesSummary
5. Architecting a Data Lake
Data Lake and the Cloud—A Perfect MarriageChallenges with On-Premises Data LakesBenefits of Cloud Data LakesDesign and ImplementationBatch and StreamData CatalogHadoop LandscapeCloud Data Lake Reference ArchitectureIntegrating the Data Lake: The Real SuperpowerAPIs to Extend the LakeThe Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta LakeInteractive Analytics with NotebooksDemocratizing Data Processing and ReportingBuild Trust in the DataData Ingestion Is Still an IT MatterML in the Data LakeTraining on Raw DataPredicting in the Data LakeSummary
6. Innovating with an Enterprise Data Warehouse
A Modern Data PlatformOrganizational GoalsTechnological ChallengesTechnology Trends and ToolsHub-and-Spoke ArchitectureData IngestBusiness IntelligenceTransformationsOrganizational StructureDWH to Enable Data ScientistsQuery InterfaceStorage APIML Without Moving Your DataSummary
7. Converging to a Lakehouse
The Need for a Unique ArchitectureUser PersonasAntipattern: Disconnected SystemsAntipattern: Duplicated DataConverged ArchitectureTwo FormsLakehouse on Cloud StorageSQL-First LakehouseThe Benefits of ConvergenceSummary
8. Architectures for Streaming
The Value of StreamingIndustry Use CasesStreaming Use CasesStreaming IngestStreaming ETLStreaming ELTStreaming InsertStreaming from Edge Devices (IoT)Streaming SinksReal-Time DashboardsLive QueryingMaterialize Some ViewsStream AnalyticsTime-Series AnalyticsClickstream AnalyticsAnomaly DetectionResilient StreamingContinuous Intelligence Through MLTraining Model on Streaming DataStreaming ML InferenceAutomated ActionsSummary
9. Extending a Data Platform Using Hybrid and Edge
Why Multicloud?A Single Cloud Is Simpler and Cost-EffectiveMulticloud Is InevitableMulticloud Could Be StrategicMulticloud Architectural PatternsSingle Pane of GlassWrite Once, Run AnywhereBursting from On Premises to CloudPass-Through from On Premises to CloudData Integration Through StreamingAdopting MulticloudFrameworkTime ScaleDefine a Target Multicloud ArchitectureWhy Edge Computing?Bandwidth, Latency, and Patchy ConnectivityUse CasesBenefitsChallengesEdge Computing Architectural PatternsSmart DevicesSmart GatewaysML ActivationAdopting Edge ComputingThe Initial ContextThe ProjectThe Final Outcomes and Next StepsSummary

10. AI Application Architecture
Is This an AI/ML Problem?Subfields of AIGenerative AIProblems Fit for MLBuy, Adapt, or Build?Data ConsiderationsWhen to BuyWhat Can You Buy?How Adapting WorksAI ArchitecturesUnderstanding Unstructured DataGenerating Unstructured DataPredicting OutcomesForecasting ValuesAnomaly DetectionPersonalizationAutomationResponsible AIAI PrinciplesML FairnessExplainabilitySummary
11. Architecting an ML Platform
ML ActivitiesDeveloping ML ModelsLabeling EnvironmentDevelopment EnvironmentUser EnvironmentPreparing DataTraining ML ModelsDeploying ML ModelsDeploying to an EndpointEvaluate ModelHybrid and MulticloudTraining-Serving SkewAutomationAutomate Training and DeploymentOrchestration with PipelinesContinuous Evaluation and TrainingChoosing the ML FrameworkTeam SkillsTask ConsiderationsUser-CentricSummary
12. Data Platform Modernization: A Model Case
New Technology for a New EraThe Need for ChangeIt Is Not Only a Matter of TechnologyThe Beginning of the JourneyThe Current EnvironmentThe Target EnvironmentThe PoC Use CaseThe RFP Responses Proposed by Cloud VendorsThe Target EnvironmentThe Approach on MigrationThe RFP Evaluation ProcessThe Scope of the PoCThe Execution of the PoCThe Final DecisionPerorationSummary
Index
About the Authors

Content preview from Architecting Data and Machine Learning Platforms

Chapter 9. Extending a Data Platform Using Hybrid and Edge

So far in this book, we have discussed how to plan, design, and implement a data platform using the capabilities of a public cloud. However, there are many situations where a single public cloud will not be enough because it is inherent to the use case for data to originate at, be processed at, or be stored in some other location—this could be on premises, in multiple hyperscalers, or in connected intelligent devices such as smartphones or sensors. In situations like these, there is a new challenge that needs to be addressed: how do you provide a holistic view of the platform so that users can effectively mix and join the data spread across different places? In this chapter you will learn the approaches, techniques, and architectural patterns that your organization can take when dealing with such distributed architectures.

Furthermore, there are other situations where you need to make your data work in a partially connected or disconnected mode environment. You will learn in this chapter how to deal with such a situation leveraging a new approach, called edge computing, that can bring a portion of storage and compute resources out of the cloud and closer to the subject that is generating or using data itself.

Why Multicloud?

As a data leader, your organization wants you to continuously look for ways to boost business outcomes while minimizing the technology costs you incur. When it comes to data platforms, you are expected ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9781098151607Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

Chapter 9. Extending a Data Platform Using Hybrid and Edge

Why Multicloud?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.