Chapter 9. Governing Data Access
This chapter describes the challenges of providing analysts access to the data in a data lake and presents several best practices for doing so. Data lakes differ from more traditional data storage in several ways:
- Load
-
The numbers of data sets, users, and changes are extremely high.
- Frictionless ingestion
-
Because a data lake stores data for future, yet-to-be-determined analytics, it usually ingests the data with minimal, if any, processing.
- Encryption
-
There are often government or internal regulations that require sensitive or personal information to be protected, yet that data is needed for analysis.
- Exploratory nature of work
-
A lot of data science work cannot be anticipated by IT staff. Data scientists often do not know what’s available in the huge and diverse data store. This creates a catch-22 situation for traditional approaches: if analysts cannot find data that they don’t have access to, they can’t ask for access to it.
The easiest access model is to provide all analysts access to all data. Unfortunately, this cannot be done if the data is subject to government regulations (as is the case, for example, with personally identifiable information or credit card information), is copyrighted with restricted access (e.g., if it has been purchased or obtained from external sources for very specific or limited use), or is considered critical and sensitive by the company for competitive or other reasons. Most companies have data they consider ...
Get The Enterprise Big Data Lake now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.