Chapter 20. Handling Complex Data Formats

In typical data warehousing scenarios, the ETL process extracts data from operational systems to load it into the data warehouse. In most cases, both the operational system and the data warehouse use an RDBMS to store the data. For the ETL process, this means that the data is at least available in a comprehensible relational format.

As should be clear from Chapters 6 through 9, a relational format does not suddenly mean that the transformations required to get the data from source to target database are trivial or even simple. But at least data is clearly separated from metadata, and there is no question with regard to fundamental relationships between the individual pieces of data: Rows tie attribute values together in a uniform format, and foreign keys establish relationships between rows in a clear and unambiguous manner.

Perhaps a more important feature is that the consistent use of relational data simplifies transformations. No matter how drastically data is transformed, input and output still share the fundamental characteristics of a table, and can still be seen as a collection of rows that are segmented into columns.

So what if the source data is not in a tabular format? In Chapter 21, you will read about XML and JSON, which are non-tabular because they allow arbitrary nesting of data structures. But still, although non-tabular, these data formats have a clear grammar and thus a clear separation of data and metadata.

In this chapter, ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.