book

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner

October 2023

Intermediate to advanced

359 pages

10h 21m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why Do You Need a Cloud Data Platform?Who Is This Book For?Organization of This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
The Data LifecycleThe Journey to WisdomWater Pipes AnalogyCollectStoreProcess/TransformAnalyze/VisualizeActivateLimitations of Traditional ApproachesAntipattern: Breaking Down Silos Through ETLAntipattern: Centralization of ControlAntipattern: Data Marts and HadoopCreating a Unified Analytics PlatformCloud Instead of On-PremisesDrawbacks of Data Marts and Data LakesConvergence of DWHs and Data LakesHybrid CloudReasons Why Hybrid Is NecessaryChallenges of Hybrid CloudWhy Hybrid Can WorkEdge ComputingApplying AIMachine LearningUses of MLWhy Cloud for AI?Cloud InfrastructureDemocratizationReal TimeMLOpsCore PrinciplesSummary
Step 1: Strategy and PlanningStrategic GoalsIdentify StakeholdersChange ManagementStep 2: Reduce Total Cost of Ownership by Adopting a Cloud ApproachWhy Cloud Costs LessHow Much Are the Savings?When Does Cloud Help?Step 3: Break Down SilosUnifying Data AccessChoosing StorageSemantic LayerStep 4: Make Decisions in Context FasterBatch to StreamContextual InformationCost ManagementStep 5: Leapfrog with Packaged AI SolutionsPredictive AnalyticsUnderstanding and Generating Unstructured DataPersonalizationPackaged SolutionsStep 6: Operationalize AI-Driven WorkflowsIdentifying the Right Balance of Automation and AssistanceBuilding a Data CulturePopulating Your Data Science TeamStep 7: Product Management for DataApplying Product Management Principles to Data1. Understand and Maintain a Map of Data Flows in the Enterprise2. Identify Key Metrics3. Agreed Criteria, Committed Roadmap, and Visionary Backlog4. Build for the Customers You Have5. Don’t Shift the Burden of Change Management6. Interview Customers to Discover Their Data Needs7. Whiteboard and Prototype Extensively8. Build Only What Will Be Used Immediately9. Standardize Common Entities and KPIs10. Provide Self-Service Capabilities in Your Data PlatformSummary
Classifying Data Processing OrganizationsData Analysis–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Engineering–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkData Science–Driven OrganizationThe VisionThe PersonasThe Technological FrameworkSummary
Modernize Data WorkflowsHolistic ViewModernize WorkflowsTransform the Workflow ItselfA Four-Step Migration FrameworkPrepare and DiscoverAssess and PlanExecuteOptimizeEstimating the Overall Cost of the SolutionAudit of the Existing InfrastructureRequest for Information/Proposal and QuotationProof of Concept/Minimum Viable ProductSetting Up Security and Data GovernanceFrameworkArtifactsGovernance over the Life of the DataSchema, Pipeline, and Data MigrationSchema MigrationPipeline MigrationData MigrationMigration StagesSummary
Data Lake and the Cloud—A Perfect MarriageChallenges with On-Premises Data LakesBenefits of Cloud Data LakesDesign and ImplementationBatch and StreamData CatalogHadoop LandscapeCloud Data Lake Reference ArchitectureIntegrating the Data Lake: The Real SuperpowerAPIs to Extend the LakeThe Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta LakeInteractive Analytics with NotebooksDemocratizing Data Processing and ReportingBuild Trust in the DataData Ingestion Is Still an IT MatterML in the Data LakeTraining on Raw DataPredicting in the Data LakeSummary
A Modern Data PlatformOrganizational GoalsTechnological ChallengesTechnology Trends and ToolsHub-and-Spoke ArchitectureData IngestBusiness IntelligenceTransformationsOrganizational StructureDWH to Enable Data ScientistsQuery InterfaceStorage APIML Without Moving Your DataSummary
The Need for a Unique ArchitectureUser PersonasAntipattern: Disconnected SystemsAntipattern: Duplicated DataConverged ArchitectureTwo FormsLakehouse on Cloud StorageSQL-First LakehouseThe Benefits of ConvergenceSummary
The Value of StreamingIndustry Use CasesStreaming Use CasesStreaming IngestStreaming ETLStreaming ELTStreaming InsertStreaming from Edge Devices (IoT)Streaming SinksReal-Time DashboardsLive QueryingMaterialize Some ViewsStream AnalyticsTime-Series AnalyticsClickstream AnalyticsAnomaly DetectionResilient StreamingContinuous Intelligence Through MLTraining Model on Streaming DataStreaming ML InferenceAutomated ActionsSummary
Why Multicloud?A Single Cloud Is Simpler and Cost-EffectiveMulticloud Is InevitableMulticloud Could Be StrategicMulticloud Architectural PatternsSingle Pane of GlassWrite Once, Run AnywhereBursting from On Premises to CloudPass-Through from On Premises to CloudData Integration Through StreamingAdopting MulticloudFrameworkTime ScaleDefine a Target Multicloud ArchitectureWhy Edge Computing?Bandwidth, Latency, and Patchy ConnectivityUse CasesBenefitsChallengesEdge Computing Architectural PatternsSmart DevicesSmart GatewaysML ActivationAdopting Edge ComputingThe Initial ContextThe ProjectThe Final Outcomes and Next StepsSummary

Is This an AI/ML Problem?Subfields of AIGenerative AIProblems Fit for MLBuy, Adapt, or Build?Data ConsiderationsWhen to BuyWhat Can You Buy?How Adapting WorksAI ArchitecturesUnderstanding Unstructured DataGenerating Unstructured DataPredicting OutcomesForecasting ValuesAnomaly DetectionPersonalizationAutomationResponsible AIAI PrinciplesML FairnessExplainabilitySummary
ML ActivitiesDeveloping ML ModelsLabeling EnvironmentDevelopment EnvironmentUser EnvironmentPreparing DataTraining ML ModelsDeploying ML ModelsDeploying to an EndpointEvaluate ModelHybrid and MulticloudTraining-Serving SkewAutomationAutomate Training and DeploymentOrchestration with PipelinesContinuous Evaluation and TrainingChoosing the ML FrameworkTeam SkillsTask ConsiderationsUser-CentricSummary
New Technology for a New EraThe Need for ChangeIt Is Not Only a Matter of TechnologyThe Beginning of the JourneyThe Current EnvironmentThe Target EnvironmentThe PoC Use CaseThe RFP Responses Proposed by Cloud VendorsThe Target EnvironmentThe Approach on MigrationThe RFP Evaluation ProcessThe Scope of the PoCThe Execution of the PoCThe Final DecisionPerorationSummary

Content preview from Architecting Data and Machine Learning Platforms

Chapter 1. Modernizing Your Data Platform: An Introductory Overview

Data is a valuable asset that can help your company make better decisions, identify new opportunities, and improve operations. Google in 2013 undertook a strategic project to increase employee retention by improving manager quality. Even something as loosey-goosey as manager skill could be studied in a data-driven manner. Google was able to improve management favorability from 83% to 88% by analyzing 10K performance reviews, identifying common behaviors of high-performing managers, and creating training programs. Another example of a strategic data project was carried out at Amazon. The ecommerce giant implemented a recommendation system based on customer behaviors that drove 35% of purchases in 2017. The Warriors, a San Francisco basketball team, is yet another example; they enacted an analytics program that helped catapult them to the top of their league. All these—employee retention, product recommendations, improving win rates—are examples of business goals that were achieved by modern data analytics.

To become a data-driven company, you need to build an ecosystem for data analytics, processing, and insights. This is because there are many different types of applications (websites, dashboards, mobile apps, ML models, distributed devices, etc.) that create and consume data. There are also many different departments within your company (finance, sales, marketing, operations, logistics, etc.) that need data-driven ...