Chapter TenStaging Schemas

Before building the single source of truth, we recommend first making idealized versions of each of the sources in the data lake. These are called staging schemas (Figure 10.1).

Generally, a staging schema holds models (i.e. tables or views) that handle transformations of this type. The tables and columns that are unneeded or empty are removed, weird names are improved, and unwanted data is filtered out. The value of investing time into cleaning and renaming data fields lies in making the data ready for more complex numeric transformations without needing to worry about these sorts of consistencies. Some may be concerned about the space implications from having many staging tables or repeated data. The good news, however, is that the compression strategies of a modern C‐Store warehouse handle exactly this kind of data organization without noticeable performance hits.

image

Figure 10.1 The four stages of agile data organization with an intermediary step that illustrates where a “staging schema” is relevant.

Here's a last note on process: as each source is cleaned, it never hurts to give read‐only access to departments across your organization. Showing the value of the modeling and asking for feedback during creation are ways your team can earn trust and confidence from stakeholders.

Orient to the Schemas

Go through the list of tables in the schemas ...

Get The Informed Company now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.