Data Mesh in Practice
How to set the foundations for federated data ownership
This event has ended.
The data lake paradigm is often considered the scalable successor of the more curated data warehouse approach when it comes to democratization of data. However, many who set out to build a centralized data lake came back with a data swamp of unclear responsibilities, a lack of data ownership, and subpar data availability.
Accessibility and availability can only be guaranteed at scale when moving more responsibilities to those who pick up the data and have the respective domain knowledge—the data owners —while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a data mesh.
Join experts Max Schultze and Arif Wider for a concise, comprehensive overview of the data mesh. You’ll learn how to tackle the challenges of decentralized data ownership and how to provide the right platform tooling that enables data owners to take over responsibility in a scalable and sustainable fashion. You’ll also discover how to provide data in such a way that others can create value from it, and explore the concept of a data product, which goes beyond sharing of files toward guarantees of quality and acknowledgement of data ownership.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- The consequences of unclear data ownership
- What a scalable structure of domain-driven, federated responsibilities looks like
- How a shared data infrastructure platform can contribute
And you’ll be able to:
- Facilitate steps toward federated data ownership in your company
- Provide data in such a way that others can create value from it
- Support data ownership by providing domain-agnostic infrastructure tooling
This live event is for you because…
- You’re a software or data engineer.
- You work with data production, infrastructure, or consumption.
- You want to become a data product owner.
- list text here- Familiarity with distributed data processing
- A basic understanding of Python
- Read “The Trouble with Distributed Systems” and “Batch Processing” (chapters 8 and 10 in Designing Data-Intensive Applications)
- Read “DataLake” (article)
- Read “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” (article)
The timeframes are only estimates and may vary according to how the class is progressing.
Introduction to data mesh (25 minutes)
- Presentation: What’s the data mesh paradigm?; Why was it invented?
- Exercise: Jupyter Notebook setup
The data consumer perspective (45 minutes)
- Exercise: Calculate a set of business KPIs from a prepared, fairly undocumented dataset
- Presentation: Overview of data mesh—product thinking for data, domain-driven design applied to distributed data, and platform thinking for data infrastructure; issues on the consumer side
Break (5 minutes)
The data producer perspective (45 minutes)
- Presentation: What to do on the data producer side; how to create a data product; how to think about domain boundaries
- Exercise: Rewrite the introduced dataset with a proper column description; create a schema and dataset description
- Presentation: why building a good data product is hard
Break (5 minutes)
The data infrastructure platform perspective (45 minutes)
- Exercise: answer an access request by calling some prepared functions; answer repeatedly to many access requests
- Presentation: What makes a good data infrastructure platform?—domain agnostic, self-service, etc.; the trap of taking centralized responsibility for data; platform thinking—multitenancy, how to enable interoperability, and how to stay out of domain responsibility
- Demo: Build a platform capability / self service tool
Conclusion and Wrap up (10 minutes)
- Presentation: the goal state; Key learnings; what did we NOT talk about? Followup suggestions
Max Schultze is a lead data engineer working on building a data lake at Zalando, Europeâ??s biggest online platform for fashion. His focus lies on building data pipelines at petabytes scale and productionizing Spark and Presto as analytical platforms inside the company. He graduated from the Humboldt University of Berlin, where he took park in the universityâ??s initial development of Apache Flink.
Arif Wider is a professor of software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks. At Thoughtworks, he worked with Zhamak Dehghani, who coined the term Data Mesh in 2019. Outside of teaching, Arif enjoys building scalable software that makes an impact, as well as building teams that create such software. More specifically, he is fascinated by applications of Artificial Intelligence and how effectively building such applications requires data scientists and developers (like himself) to work closely together.