Chapter 18. Security in the Cloud

At this point, the path to securing an on-premises cluster is well-trodden. As covered in Chapter 9, vendor distributions of Hadoop contain a full suite of products and features providing authentication, authorization, auditing, and encryption. In this chapter, we explore how operating in a public cloud should change your approach to security. It is impossible to cover all aspects of cloud security in a single chapter, but we aim to provide you with enough information to feel comfortable about architecting Hadoop-based solutions. We begin by briefly outlining the risks and threat model for running in the cloud. Following that, we dive into the specifics for Hadoop security, including identity management, securing object storage, encryption, and network security.

To keep our discussion focused, we mostly deal with unmanaged clusters using the sticky or suspendable deployment patterns (see “Cluster Life Cycle Models”), rather than managed PaaS offerings such as Amazon Elastic MapReduce (Amazon EMR) or Google Dataproc. For additional information, review the documentation of the providers themselves. As a general reference, we also highly recommend Moving Hadoop to the Cloud (O’Reilly) by Bill Havanki.

Assessing the Risk

As an enterprise architect, you might be asking yourself what security you need in the cloud. There are a few ways to answer this question, depending on the level of risk your enterprise is willing to adopt. The right questions to ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.