Chapter ThirteenImplementing the Data Warehouse

Remember, staging schemas hold transformations that build on top of the data sources residing in a data lake. In general, different sources track the same events or users in different ways and places. To get the most use of our company data, it's useful to combine these different sources into new tables that make future analyses simple and straightforward to write. After all, there's no need for a team to reference multiple staging tables containing user information in dozens of queries or models when an intermediate staging model could be built on top of the basic transformations we've so far completed to unite the user data.

This is an iterative process of trying out different ways to combine data to make it more useful for end users. The intermediate transformations needed to build combined tables fill up a warehouse fast. Before diving into the individual tables, we want to first touch on a key tool that should guide and inform your warehouse design.

Manage Dependencies

When staging models, select from their source replicas; we say that the staging models “depend” on their respective source replicas. This is the language of mathematical graphs. Using the language of graph theory proves an extremely useful tool when trying to keep track of dozens, if not hundreds, of data models in a mature warehouse. In particular, a large interconnected network of models can be visualized as a Directed Acyclic Graph (Figure 13.1).

Let's ...

Get The Informed Company now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.