book

Streaming Data Mesh

by Hubert Dulay, Stephen Mooney

May 2023

Intermediate to advanced

223 pages

5h 58m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsHubertStephen
Data DivideData Mesh PillarsData OwnershipData as a ProductFederated Computational Data GovernanceSelf-Service Data PlatformData Mesh DiagramOther Similar Architectural PatternsData FabricData Gateways and Data ServicesData DemocratizationData VirtualizationFocusing on ImplementationApache KafkaAsyncAPI
The Streaming AdvantageStreaming Enables Real-Time Use CasesStreaming Enables Data Optimization AdvantagesReverse ETLThe Kappa ArchitectureLambda Architecture IntroductionKappa Architecture IntroductionSummary
Identifying DomainsDiscernible DomainsGeographic RegionsHybrid ArchitectureMulticloudAvoiding Ambiguous DomainsDomain-Driven DesignDomain ModelDomain LogicBounded ContextThe Ubiquitous LanguageData Mesh Domain RolesData Product EngineerData Product Owner or Data StewardStreaming Data Mesh Tools and Platforms to ConsiderDomain Charge-BacksSummary
Defining Data Product RequirementsIdentifying Data Product DerivativesDerivatives from Other DomainsIngesting Data Product Derivatives with Kafka ConnectConsumabilitySynchronous Data SourcesAsynchronous Data Sources and Change Data CaptureDebezium ConnectorsTransforming Data Derivatives to Data ProductsData StandardizationProtecting Sensitive InformationSQLExtract, Transform, and LoadPublishing Data Products with AsyncAPIRegistering the Streaming Data ProductBuilding an AsyncAPI YAML DocumentAssigning Data TagsVersioningMonitoringSummary
Data Governance in a Streaming Data MeshData Lineage GraphStreaming Data Catalog to Organize Data ProductsMetadataSchemasLineageSecurityScalabilityGenerating the Data Product Page from AsyncAPIApicurio RegistryAccess WorkflowCentralized Versus DecentralizedCentralized EngineersDecentralized (Domain) EngineersSummary
Streaming Data Mesh CLIResource-Related CommandsCluster-Related CommandsTopic-Related CommandsThe domain CommandsThe connect CommandsThe streaming CommandsPublishing a Streaming Data ProductData Governance-Related ServicesSecurity ServicesStandards ServicesLineage ServicesSaaS Services and APIsSummary
InfrastructureTwo Architecture SolutionsDedicated InfrastructureMultitenant InfrastructureStreaming Data Mesh Central ArchitectureThe Domain Agent (aka Sidecar)Data PlaneControl PlaneSummary
The Traditional Data Warehouse StructureIntroducing the Decentralized Team StructureEmpowering PeopleWorking ProcessesFostering CollaborationData-Driven AutomationNew Roles in Data DomainsNew Roles in the Data PlaneNew Roles in Data Science and Business Intelligence
Separating Data Engineering from Data ScienceOnline and Offline Data StoresApache Feast IntroductionSummary

Streaming Data Mesh ExampleDeploying an On-Premises Streaming Data MeshInstalling a ConnectorDeploying Clickstream Connector and Auto-Creating TablesDeploying the Debezium Postgres CDC ConnectorEnrichment of Streaming DataPublishing the Data ProductConsuming Streaming Data ProductsFully Managed SaaS ServicesSummary and Considerations

Content preview from Streaming Data Mesh

Chapter 1. Data Mesh Introduction

Youngsters think that at some point data architectures were easy, and then data volume, velocity, variety grew and we needed new architectures which are hard. In reality, data problems were always organization problems and therefore were never solved.

Gwen (Chen) Shapira, Kafka: The Definitive Guide (O’Reilly)

If you’re working at a growing company, you’ll realize that a positive correlation exists between company growth and the scale of ingress data. This could be from increased usage for existing applications or newly added applications and features. It’s up to the data engineer to organize, optimize, process, govern, and serve this growing data to the consumers while maintaining service-level agreements (SLAs). Most likely, these SLAs were guaranteed to the consumers without the data engineer’s input. The first thing you learn when working with such a large amount of data is that when the data processing starts to encroach toward the guarantees made by these SLAs, more focus is put on staying within the SLAs, and things like data governance are marginalized. This in turn generates a lot of distrust in the data being served and ultimately distrust in the analytics—the same analytics that can be used to improve operational applications to generate more revenue or prevent revenue loss.

If you replicate this problem across all lines of business in the enterprise, you start to get very unhappy data engineers trying to speed up data pipelines within ...