book

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

October 2023

Intermediate to advanced

359 pages

10h 21m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why Do You Need a Cloud Data Platform?Who Is This Book For?Organization of This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
The Data LifecycleThe Journey to WisdomWater Pipes AnalogyCollectStoreProcess/TransformAnalyze/VisualizeActivateLimitations of Traditional ApproachesAntipattern: Breaking Down Silos Through ETLAntipattern: Centralization of ControlAntipattern: Data Marts and HadoopCreating a Unified Analytics PlatformCloud Instead of On-PremisesDrawbacks of Data Marts and Data LakesConvergence of DWHs and Data LakesHybrid CloudReasons Why Hybrid Is NecessaryChallenges of Hybrid CloudWhy Hybrid Can WorkEdge ComputingApplying AIMachine LearningUses of MLWhy Cloud for AI?Cloud InfrastructureDemocratizationReal TimeMLOpsCore PrinciplesSummary
Step 1: Strategy and PlanningStrategic GoalsIdentify StakeholdersChange ManagementStep 2: Reduce Total Cost of Ownership by Adopting a Cloud ApproachWhy Cloud Costs LessHow Much Are the Savings?When Does Cloud Help?Step 3: Break Down SilosUnifying Data AccessChoosing StorageSemantic LayerStep 4: Make Decisions in Context FasterBatch to StreamContextual InformationCost ManagementStep 5: Leapfrog with Packaged AI SolutionsPredictive AnalyticsUnderstanding and Generating Unstructured DataPersonalizationPackaged SolutionsStep 6: Operationalize AI-Driven WorkflowsIdentifying the Right Balance of Automation and AssistanceBuilding a Data CulturePopulating Your Data Science TeamStep 7: Product Management for DataApplying Product Management Principles to Data1. Understand and Maintain a Map of Data Flows in the Enterprise2. Identify Key Metrics3. Agreed Criteria, Committed Roadmap, and Visionary Backlog4. Build for the Customers You Have5. Don’t Shift the Burden of Change Management6. Interview Customers to Discover Their Data Needs7. Whiteboard and Prototype Extensively8. Build Only What Will Be Used Immediately9. Standardize Common Entities and KPIs10. Provide Self-Service Capabilities in Your Data PlatformSummary
Classifying Data Processing OrganizationsData Analysis–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Engineering–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Science–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkSummary
Modernize Data WorkflowsHolistic ViewModernize WorkflowsTransform the Workflow ItselfA Four-Step Migration FrameworkPrepare and DiscoverAssess and PlanExecuteOptimizeEstimating the Overall Cost of the SolutionAudit of the Existing InfrastructureRequest for Information/Proposal and QuotationProof of Concept/Minimum Viable ProductSetting Up Security and Data GovernanceFrameworkArtifactsGovernance over the Life of the DataSchema, Pipeline, and Data MigrationSchema MigrationPipeline MigrationData MigrationMigration StagesSummary
Data Lake and the Cloud—A Perfect MarriageChallenges with On-Premises Data LakesBenefits of Cloud Data LakesDesign and ImplementationBatch and StreamData CatalogHadoop LandscapeCloud Data Lake Reference ArchitectureIntegrating the Data Lake: The Real SuperpowerAPIs to Extend the LakeThe Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta LakeInteractive Analytics with NotebooksDemocratizing Data Processing and ReportingBuild Trust in the DataData Ingestion Is Still an IT MatterML in the Data LakeTraining on Raw DataPredicting in the Data LakeSummary
A Modern Data PlatformOrganizational GoalsTechnological ChallengesTechnology Trends and ToolsHub-and-Spoke ArchitectureData IngestBusiness IntelligenceTransformationsOrganizational StructureDWH to Enable Data ScientistsQuery InterfaceStorage APIML Without Moving Your DataSummary
The Need for a Unique ArchitectureUser PersonasAntipattern: Disconnected SystemsAntipattern: Duplicated DataConverged ArchitectureTwo FormsLakehouse on Cloud StorageSQL-First LakehouseThe Benefits of ConvergenceSummary
The Value of StreamingIndustry Use CasesStreaming Use CasesStreaming IngestStreaming ETLStreaming ELTStreaming InsertStreaming from Edge Devices (IoT)Streaming SinksReal-Time DashboardsLive QueryingMaterialize Some ViewsStream AnalyticsTime-Series AnalyticsClickstream AnalyticsAnomaly DetectionResilient StreamingContinuous Intelligence Through MLTraining Model on Streaming DataStreaming ML InferenceAutomated ActionsSummary
Why Multicloud?A Single Cloud Is Simpler and Cost-EffectiveMulticloud Is InevitableMulticloud Could Be StrategicMulticloud Architectural PatternsSingle Pane of GlassWrite Once, Run AnywhereBursting from On Premises to CloudPass-Through from On Premises to CloudData Integration Through StreamingAdopting MulticloudFrameworkTime ScaleDefine a Target Multicloud ArchitectureWhy Edge Computing?Bandwidth, Latency, and Patchy ConnectivityUse CasesBenefitsChallengesEdge Computing Architectural PatternsSmart DevicesSmart GatewaysML ActivationAdopting Edge ComputingThe Initial ContextThe ProjectThe Final Outcomes and Next StepsSummary

Is This an AI/ML Problem?Subfields of AIGenerative AIProblems Fit for MLBuy, Adapt, or Build?Data ConsiderationsWhen to BuyWhat Can You Buy?How Adapting WorksAI ArchitecturesUnderstanding Unstructured DataGenerating Unstructured DataPredicting OutcomesForecasting ValuesAnomaly DetectionPersonalizationAutomationResponsible AIAI PrinciplesML FairnessExplainabilitySummary
ML ActivitiesDeveloping ML ModelsLabeling EnvironmentDevelopment EnvironmentUser EnvironmentPreparing DataTraining ML ModelsDeploying ML ModelsDeploying to an EndpointEvaluate ModelHybrid and MulticloudTraining-Serving SkewAutomationAutomate Training and DeploymentOrchestration with PipelinesContinuous Evaluation and TrainingChoosing the ML FrameworkTeam SkillsTask ConsiderationsUser-CentricSummary
New Technology for a New EraThe Need for ChangeIt Is Not Only a Matter of TechnologyThe Beginning of the JourneyThe Current EnvironmentThe Target EnvironmentThe PoC Use CaseThe RFP Responses Proposed by Cloud VendorsThe Target EnvironmentThe Approach on MigrationThe RFP Evaluation ProcessThe Scope of the PoCThe Execution of the PoCThe Final DecisionPerorationSummary

Content preview from Architecting Data and Machine Learning Platforms

Chapter 11. Architecting an ML Platform

In the previous chapter, we discussed the overall architecture of ML applications and that in many cases you will use prebuilt ML models. In some cases, your team will have to develop the ML model that is at the core of the ML application.

In this chapter, you will delve into the development and deployment of such custom ML models. You will look at the stages in the development of ML models and the frameworks that support such development. After the model has been created, you will need to automate the training process by looking into tools and products that can help you make this transition. Finally, you will need to monitor the behavior of your trained models that have been deployed to endpoints to see if they are drifting when making inferences.

In earlier chapters, we discussed ML capabilities that are enabled by various parts of the data platform. Specifically, the data storage for your ML platform can be in the data lake (Chapter 5) or DWH (Chapter 6), the training would be carried out on compute that is efficient for that storage, and the inference can be invoked from a streaming pipeline (Chapter 8) or deployed to the edge (Chapter 9). In this chapter, we will pull all of these discussions together and consider what goes into these ML capabilities.

ML Activities

If you are building an ML platform to support custom ML model development, what activities do you need to support? Too often, we see architects jump straight to the ML framework ...