Validating data before it gets into the persistence layer of Data Lake is a very important step. Validation in the context of Data Lake means two aspects as follows:
- Origin of data: Making sure right data from right source is ingested into the Data Lake. The source from where data originates should be known and also the data coming in also should be authorized by Data Lake to be ingested.
- Quality of data: Making sure that certain data that are ingested into Data Lake has some initial checks done on its attributes to make sure that the data coming in and it's format qualifies to the format it states. For example data attribute in a record stating it as an email could be checked/validated for a proper email format. ...