Chapter 16. Solutions in the Public Cloud

Our discussion of public clouds is different from that of private clouds. In contrast to private cloud Hadoop services, there are thousands of examples in which large organizations and enterprises are successfully running Hadoop in the public cloud.

In the coming chapters, we focus our discussion on the three largest public cloud providers in the market:

  • Amazon Web Services (AWS)

  • Microsoft Azure

  • Google Cloud Project (GCP)

This chapter looks at the portfolios of our three cloud providers through the lens of Hadoop. We cover the key categories for each: instances, storage, and possible life cycle models. Next, we offer advice on how to use the provider portfolios to implement clusters and big data use cases.

Key Things to Know

Part of the value proposition for the cloud is that IT services become a black box that you do not have to worry about. This also means that you do not know what is going on inside the black box. For our intent of running Hadoop in a public cloud, this is mostly good news (and some bad news at the same time). Here are some key things to keep in mind:

Life cycle models

“Cluster Life Cycle Models” explains that storage choices define the life cycle options of virtual Hadoop clusters. In the public cloud, much attention shifts toward transient life cycle models because they are much easier to implement. You should take care when implementing sticky clusters in the public cloud, since the instances that host ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.