Chapter 2. ETL Data Structures

The back room area of the data warehouse has frequently been called the staging area. Staging in this context means writing to disk and, at a minimum, we recommend staging data at the four major checkpoints of the ETL data flow. The ETL team will need a number of different data structures to meet all the legitimate staging needs, and thus the purpose of this chapter is to describe all the types of data structures you are likely to need.

This chapter does not describe all the source data types you must extract from. We leave that to Chapter 3!

Note

PROCESS CHECK

Planning & Design:

Requirements/Realities → Architecture → Implementation → Test/Release

Data Flow: Extract → Clean → Conform → Deliver

We also try to step back from the details of the data structures to recommend general design principles for the staging area, including planning and design standards and an introduction to the metadata needed to support staging. Metadata is a big topic, and we gather a number of more specific metadata designs in Chapter 4 in the cleaning and conforming steps. Also, we tie all the metadata topics together toward the end of the book in Chapter 9.

To Stage or Not to Stage

The decision to store data in a physical staging area versus processing it in memory is ultimately the choice of the ETL architect. The ability to develop efficient ETL processes is partly dependent on being able to determine the right balance between physical input and output (I/O) and in-memory processing. ...

Get The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.