Chapter 5. Governing Your Data Lake

Now that you’ve built the “house” for your data lake, you need to consider governance. And you need to do it before you open the data lake up to users, because the kind of governance you put into place will directly affect data lake security, user productivity, and overall operational costs. As described in Chapter 3, you need to create three governance plans:

  • Data governance

  • Financial governance

  • Security governance

Data Governance

When formulating a data governance policy, you’ll inevitably encounter these questions about the data life cycle:

  • How long is this data good for?

  • How long will it be valuable?

  • Should I keep it forever or eventually throw it away?

  • Do I need to store it because of government regulations?

  • Should I put it into “colder” storage to lower costs?

Many enterprises have data that doesn’t need to be accessed frequently. In fact, your data has a natural life cycle, and an important data governance task is to manage data as it moves between various storage resources over the course of that life cycle. Storage life-cycle management is thus becoming an increasingly important aspect of data storage decisions. The cloud offers a variety of storage options based on volume, cost, and performance that you can choose from depending on where in its life cycle your data currently resides.

Public cloud providers like AWS and Azure offer storage life-cycle management services. These allow you to move data to and from ...

Get Operationalizing the Data Lake now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.