Chapter 3. AWS, Azure, and GCP Architecture

The cloud is the preferred destination for most data lakes these days, and cloud data lakes are driving the momentum in the big data space. Cloud providers are not only benefiting from this trend but also making migrations easier. Today, the market is filled with dozens of companies offering data ingestion and integration options for batch, streaming, and API sources—such as StreamSets, Striim, HVR, and Cloudera DataFlow. Vendors are offering optimized analytical engines, such as SQL engines including Presto, Dremio, and AtScale; and ML frameworks, such as H2O, RapidMiner, and KNIME. The gauntlet has been thrown: we are wresting control of data lakes by adding data governance to ensure they don’t become data swamps anymore. Unfortunately, this is the most underdeveloped capability of the cloud platforms thus far and requires external products to secure the data lakes and meet the ever-growing demands emanating from the global data privacy compliance regulations.

Finally, building data lakes requires having a deep knowledge of various tools and technologies. Because finding skilled talent is a perpetual issue, cloud vendors are starting to add more automation and packaging components together to facilitate building and deploying cloud data lakes.

This chapter provides an overview of end-to-end architectures of three of the main public cloud providers: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). The focus ...

Get What Is a Data Lake? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.