CHAPTER 8

Information Product Improvement: Data Reengineering and Cleansing

“After the first four years, the dirt doesn't get any worse.”

–QUENTIN CRISP, THE NAKED CIVIL SERVANT

Data reengineering and cleansing is the process of information product improvement. It serves to take existing data that is defective and correct the deficiencies to bring it to an acceptable level of quality. This process of information “scrap and rework” is similar to the process of manufacturing scrap and rework. Like a defective manufactured product that requires rework to correct the defects, missing or incorrect data requires rework to be cleansed. Data that cannot be cleansed (completed or corrected) is “scrapped”; in other words, thrown out or identified as not correctable in the same way an unfixable manufactured product is scrapped.

In this chapter we define what is meant by information product improvement. Information product improvement, basically the correction of defective data, is sometimes called data reengineering, data cleansing, data scrubbing, or data transformation. We define the three areas of information product improvement: cleansing data at the source database, for data conversion, and for data warehousing.

We next describe the kinds of data problems encountered in a data cleanup project. Information quality problems exist in the data definition and data architecture itself, as well as in the data within databases and files.

We then describe the steps of the data reengineering and ...

Get Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.