Chapter 3. Curating the Data Lake

Although it is exciting to have a cost-effective scale-out platform, without controls in place, no one will trust it for business-critical applications. It might work for ad hoc use cases, but you still need a management and governance layer that organizations are accustomed to having in traditional data warehouse environments if you want to scale and use the value of the lake.

For example, consider a bank aggregating risk data across different lines of business into a common risk reporting platform for the Basel Committee on Banking Supervision (BCBS) 239. The data must be of very high quality and have good lineage to ensure that the reports are correct, because the bank depends on those reports to make key decisions about how much capital to carry. Without this lineage, there are no guarantees that the data is accurate.

A data lake makes perfect sense for this kind of data, because it can scale out as you bring together large volumes of different risk datasets across different lines of business. But data lakes need a management platform in order to support metadata as well as quality and governance controls. To succeed at applying data lakes to these kinds of business use cases, you need controls in place.

This includes the right tools and the right process. Process can be as simple as assigning stewards to new datasets, or forming a data lake enterprise data council, to establish data definitions and standards.

Questions to ask when considering ...

Get Architecting Data Lakes, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.