Chapter 7. Architecting the Data Lake

There are many ways to organize data in a data lake. In this chapter, we will start with how to organize a data lake into zones. Then we’ll compare and contrast on-premises and cloud data lakes. Finally, we’ll discuss virtual data lakes, which minimize resource usage and the overhead of maintaining a data lake while providing equivalent functionality to physical data lakes.

Organizing the Data Lake

Once a data lake is established, the analysts need a way to find and understand the data it contains. This is a formidable task when you consider the wide variety of data in most enterprises (one large retailer I spoke with had over 30,000 data sources feeding its data lake, and said that each source might provide hundreds or even thousands of tables). Even if analysts find the right data set, they need to know whether they can trust the data. Finally, to enable users to freely roam the lake, sensitive data must be identified and protected so that it is not exposed inadvertently. All these tasks fall under the umbrella of data governance.

In the old days of data warehousing, data governance was implemented by a large team of data stewards, data architects, and data engineers. Changes had to be carefully reviewed and approved. Data quality, data access, management of sensitive data, and other aspects of data governance were carefully considered and managed. But in the era of self-service, this approach does not scale. In fact, the exploratory and ...

Get The Enterprise Big Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.