Chapter 2. Designing Your Data Lake

Determining what technologies to employ when building your data lake stack is a complex undertaking. You must consider storage, processing, data management, and so on. Figure 2-1 shows the relationships among these tasks.

adl2 0201
Figure 2-1. The data lake technology stack

Cloud, On-Premises, Multicloud, or Hybrid

In the past, most data lakes resided on-premises. This has undergone a tremendous shift recently, with most companies looking to the cloud to replace or augment their implementations.

Whether to use on-premises or cloud storage and processing is a complicated and important decision point for any organization. The pros and cons to each could fill a book and are highly dependent on the individual implementation. Generally speaking, on-premises storage and processing offers tighter control over data security and data privacy, whereas public cloud systems offer highly scalable and elastic storage and computing resources to meet enterprises’ need for large scale processing and data storage without having the overheads of provisioning and maintaining expensive infrastructure.

Also, with the rapidly changing tools and technologies in the ecosystem, we have also seen many examples of cloud-based data lakes used as the incubator for dev/test environments to evaluate all the new tools and technologies at a rapid pace before picking the right one to ...

Get Architecting Data Lakes, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.