Chapter 7. Cleansing and Conforming

For many people, the core value of ETL is hidden behind the T, which denotes the Transform capabilities. Many people are still not very comfortable with a single T for all the work that takes place in this phase. Ralph Kimball has referred to ETL as ECCD, short for Extract, Cleanse, Conform, and Deliver, and Matt Casters used the recursive acronym Kettle as a name for the ETL tool where the double T stands for Transportation and Transformation. In any case, this chapter addresses the four of the 34 subsystems that cover cleansing and conforming data, or more specifically:

  • Subsystem 4: Data-Cleansing

  • Subsystem 5: Error Event Schema

  • Subsystem 6: Audit Dimension Assembler

  • Subsystem 7: Deduplication

The examples in this chapter are all based on the Sakila customer and address tables, but here we have messed up the data a bit in order to make the cleansing steps actually do something. We created duplicate records, misspelled some names, and added a few extra rows with our own information. The script needed to make these modifications, sakilamods.sql, can be downloaded from the book's companion site at www.wiley.com/go/kettlesolutions, but you can, of course, make your own modifications as well. The requirements for having the example transformations perform the designated task are described for each example to make it easy to follow along.

A large part of this chapter is devoted to data validation because before you can cleanse data, it must be clear which ...

Get Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.