The corrupt data validation and cleansing design pattern

This design pattern discusses data corruption from the perspective of the corrupt data being treated as a noise or as an outlier. The techniques to identify and cleanse the corrupt data are discussed in detail.


This design pattern explores the usage of Pig to validate and cleanse corrupt data from a dataset. It tries to set the context of data corruption from various sources of Big Data ranging from sensor to structured data. This design pattern probes the data corruption angle from two perspectives, one is noise and the other is outliers, as given in the following list:

  • Noise can be defined as a random error in measurement that has caused corrupt data to be ingested along with the ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.