Chapter 5. Deriving Value from the Data Lake

The purpose of a data lake is to provide value to the business by serving users. From a user perspective, these are the most important questions to ask about the data:

  • What is in the data lake (the catalog)?

  • What is the quality of the data?

  • What is the profile of the data?

  • What is the metadata of the data?

  • How can users do enrichments, clean ups, enhancements, and aggregations without going to IT (how to use the data lake in a self-service way)?

  • How can users annotate and tag the data?

Answering these questions requires that proper architecture, governance, and security rules are put in place and adhered to, so that the right people get access to the right data in a timely manner. There also needs to be strict governance in the onboarding of data sets, naming conventions have to be established and enforced, and security policies have to be in place to ensure role-based access control.

Self-Service

For our purposes, self-service means that non-technical business users can access and analyze data without involving IT.

In a self-service model, users should be able to see the metadata and profiles and understand what the attributes of each data set mean. The metadata must provide enough information for users to create new data formats out of existing data formats, using enrichments and analytics.

Also, in a self-service model, the catalog will be the foundation for users to register all of the different data sets in the data lake. This ...

Get Architecting Data Lakes now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.