Chapter 2. Security and Compliance Challenges and Constraints in DevOps

Let’s begin by looking at the major security and compliance challenges and constraints for DevOps.

Speed: The Velocity of Delivery

The velocity of change in IT continues to increase. This became a serious challenge for security and compliance with Agile development teams delivering working software in one- or two-week sprints. But the speed at which some DevOps shops initiate and deliver changes boggles the mind. Organizations like Etsy are pushing changes to production 50 or more times each day. Amazon has thousands of small (“two pizza”) engineering teams working independently and continuously deploying changes across their infrastructure. In 2014, Amazon deployed 50 million changes: that’s more than one change deployed every second of every day.1

So much change so fast...

How can security possibly keep up with this rate of change? How can they understand the risks, and what can they do to manage them when there is no time to do pen testing or audits, and no place to put in control gates, and you can’t even try to add a security sprint or a hardening sprint in before the system is released to production?

Where’s the Design?

DevOps builds on Agile development practices and extends Agile ideas and practices from development into operations.

A challenge for many security teams already working in Agile environments is that developers spend much less time upfront on design. The Agile manifesto emphasizes “working software over documentation,” which often means that “the code is the design,” and “Big Design Up Front” is an antipattern in most Agile shops. These teams want to start with a simple, clean approach and elaborate the design in a series of sprints as they get more information about the problem domain. The principle of YAGNI (“You Ain’t Gonna Need It”) reminds them that most features specified upfront might never be used, so they try to cut the design and feature-set to a minimum and deliver only what is important, as soon as they can, continuously refactoring as they make changes.

Many DevOps teams take all of these ideas even further, especially teams following a Lean Startup approach. They start with a Minimum Viable Product (MVP): the simplest and cheapest implementation of an idea with a basic working feature-set, which is delivered to real users in production as soon as possible. Then, using feedback from those users, collecting metrics, and running A/B experiments, they iterate and fill out the rest of the functionality, continuously delivering changes and new features as quickly as possible in order to get more feedback to drive further improvements in a continuous loop, adapting and pivoting as needed.

All of this makes it difficult for security teams to understand where and how they should step in and make sure that the design is secure before coding gets too far along. How do you review the design if design isn’t done, when there is no design document to be handed off, and the design is constantly changing along with the code and the requirements? When and how should you do threat modeling?

Eliminating Waste and Delays

DevOps is heavily influenced by Lean principles: maximizing efficiency and eliminating waste and delays and unnecessary costs. Success is predicated on being first to market with a new idea or feature, which means that teams measure—and optimize for—the cycle-time-to-delivery. Teams, especially those in a Lean Startup, want to fail fast and fail early. They do rapid prototyping and run experiments in production, with real users, to see if their idea is going to catch on or to understand what they need to change to make it successful.

This increases the tension between delivery and security. How much do engineers need to invest—how much can they afford—in securing code that could be thrown away or rewritten in the next few days? When is building in security the responsible thing to do, and when is it wasting limited time and effort?

In an environment driven by continuous flow of value and managed through Kanban and other Lean techniques to eliminate bottlenecks and using automation to maximize efficiency, security cannot get in the way. This is a serious challenge for security and compliance, who are generally more concerned about doing things right and minimizing risk than being efficient or providing fast turnaround.

It’s in the Cloud

Although you don’t need to run your systems in the cloud to take advantage of DevOps, you probably do need to follow DevOps practices if you are operating in the cloud. This means that the cloud plays a big role in many organizations’ DevOps stories (and vice versa).

In today’s cloud—Infrastructure as a Service (IaaS) and Platform as a Service (PaaS)—platforms like Amazon AWS, Microsoft Azure, and Google Cloud Platform do so much for you. They eliminate the wait for hardware to be provisioned; they take away the upfront cost of buying hardware and setting up a data center; and they offer elastic capacity on demand to keep up with growth. These services hide the details of managing the data center and networks, and standing up and configuring and managing and monitoring servers and storage.

There are so many capabilities included now, capabilities that most shops can’t hope to provide on their own, including built-in security and operations management functions. A cloud platform like AWS offers extensive APIs into services for account management, data partitioning, auditing, encryption and key management, failover, storage, monitoring, and more. They also offer templates for quickly setting up standard configurations.

But you need to know how to find and use all of this properly. And in the shared responsibility model for cloud operations, you need to understand where the cloud provider’s responsibilities end and yours begin, and how to ensure that your cloud provider is actually doing what you need them to do.

The Cloud Security Alliance’s “Treacherous Twelve” highlights some of the major security risks facing users of cloud computing services:

  1. Data breaches

  2. Weak identity, credential, and access management

  3. Insecure interfaces and APIs

  4. System and application vulnerabilities

  5. Account hijacking

  6. Malicious insiders

  7. Advanced Persistent Threats (APTs)

  8. Data loss

  9. Insufficient due diligence

  10. Abuse and nefarious use of cloud services

  11. Denial of Service

  12. Shared technology issues

Microservices

Microservices are another part of many DevOps success stories. Microservices—designing small, isolated functions that can be changed, tested, deployed, and managed completely independently—lets developers move fast and innovate without being held up by the rest of the organization. This architecture also encourages developers to take ownership for their part of the system, from design to delivery and ongoing operations. Amazon and Netflix have had remarkable success with building their systems as well as their organizations around microservices.

But the freedom and flexibility that microservices enable come with some downsides:

  • Operational complexity. Understanding an individual microservice is simple (that’s the point of working this way). Understanding and mapping traffic flows and runtime dependencies between different microservices, and debugging runtime problems or trying to prevent cascading failures is much harder. As Michael Nygard says: “An individual microservice fits in your head, but the interrelationships among them exceed any human’s understanding.”

  • Attack surface. The attack surface of any microservice might be tiny, but the total attack surface of the system can be enormous and hard to see.

  • Unlike a tiered web application, there is no clear perimeter, no obvious “choke points” where you can enforce authentication or access control rules. You need to make sure that trust boundaries are established and consistently enforced.

  • The polyglot programming problem. If each team is free to use what they feel are the right tools for the job (like at Amazon), it can become extremely hard to understand and manage security risks across many different languages and frameworks.

  • Unless all of the teams agree to standardize on a consistent activity logging strategy, forensics and auditing across different services with different logging approaches can be a nightmare.

Containers

Containers—LXC, rkt, and (especially) Docker—have exploded in DevOps.

Container technologies like Docker make it much easier for developers to package and deploy all of the runtime dependencies that their application requires. This eliminates the “works on my machine” configuration management problem, because you can ship the same runtime environment from development to production along with the application.

Using containers, operations can deploy and run multiple different stacks on a single box with much less overhead and less cost than using virtual machines. Used together with microservices, this makes it possible to support microsegmentation; that is, individual microservices each running in their own isolated, individually managed runtime environments.

Containers have become so successful, Docker in particular, because they make packaging and deployment workflows easy for developers and for operations. But this also means that it is easy for developers—and operations—to introduce security vulnerabilities without knowing it.

The ease of packaging and deploying apps using containers can also lead to unmanageable container sprawl, with many different stacks (and different configurations and versions of these stacks) deployed across many different environments. Finding them all (even knowing to look for them in the first place), checking them for vulnerabilities, and making sure they are up-to-date with the latest patches can become overwhelming.

And while containers provide some isolation and security protection by default, helping to reduce the attack surface of an application, they also introduce a new set of security problems. Adrian Mouat, author of Using Docker, lists five security concerns with using Docker that you need to be aware of and find a way to manage:

Kernel exploit
The kernel is shared between the host and all of the kernels, which means that a vulnerability in the kernel exposes everything running on the machine to attack.
Denial of Service attacks
Problems in one container can DoS everything else running on the same machine, unless you limit resources using cgroups.
Container breakouts
Because isolation in containers is not as strong as in a virtual machine, you should assume that if an attacker gets access to one container, he could break into any of the other containers on that machine.
Poisoned images
Docker makes it easy to assemble a runtime stack by pulling down dependencies from registries. However, this also makes it easy to introduce vulnerabilities by pulling in out-of-date images, and it makes it possible for bad guys to introduce malware along the chain. Docker and the Docker community provide tools like trusted registries and image scanning to manage these risks, but everyone has to use them properly.
Compromising secrets
Containers need secrets to access databases and services, and these secrets need to be protected.

You can lock down a container by using CIS guidelines and other security best practices and using scanning tools like Docker Bench, and you can minimize the container’s attack surface by stripping down the runtime dependencies and making sure that developers don’t package up development tools in a production container. But all of this requires extra work and knowing what to do. None of it comes out of the box.

Separation of Duties in DevOps

DevOps presents some challenges to compliance. One of the most difficult ones to address is Separation of Duties (SoD).

SoD between ops and development is designed to reduce the risk of fraud and prevent expensive mistakes and insider attacks by ensuring that individuals cannot make a change without approval or transparency. Separation of Duties is spelled out as a fundamental control in security and governance frameworks like ISO 27001, NIST 800-53, COBIT and ITIL, SSAE 16 assessments, and regulations such as SOX, GLBA, MiFID II, and PCI DSS.

Auditors look closely at SoD, to ensure that requirements for data confidentiality and integrity are satisfied; that data and configuration cannot be altered by unauthorized individuals; and that confidential and sensitive data cannot be viewed by unauthorized individuals. They review change control procedures and approval gates to ensure that no single person has end-to-end control over changes to the system, and that management is aware of all material changes before they are made, and that changes have been properly tested and reviewed to ensure that they do not violate regulatory requirements. They want to see audit trails to prove all of this.

Even in compliance environments that do not specifically call for SoD, strict separation is often enforced to avoid the possibility or the appearance of a conflict of interest or a failure of controls.

By breaking down silos and sharing responsibilities between developers and operations, DevOps seems to be in direct conflict with SoD. Letting developers push code and configuration changes out to production in Continuous Deployment raises red flags for auditors. However, as we’ll look at in Chapter 5, it’s possible to make the case that this can be done, as long as strict automated and manual controls and auditing are in place.

Another controversial issue is granting developers access to production systems in order to help support (and sometimes even help operate) the code that they wrote, following Amazon’s “You build it, you run it” model. At the Velocity Conference in 2009, John Allspaw and Paul Hammond made strong arguments for giving developers access, or at least limited access, to production:2

Allspaw: “I believe that ops people should make sure that developers can see what’s happening on the systems without going through operations... There’s nothing worse than playing phone tag with shell commands. It’s just dumb.

Giving someone [i.e., a developer] a read-only shell account on production hardware is really low risk. Solving problems without it is too difficult.”

Hammond: “We’re not saying that every developer should have root access on every production box.”

Any developer access to a regulated system, even read-only access, raises questions and problems for regulators, compliance, infosec, and customers. To address these concerns, you need to put strong compensating controls in place:

  • Limit access to nonpublic data and configuration.

  • Review logging code carefully to ensure that logs do not contain confidential data.

  • Audit and review everything that developers do in production: every command they execute, every piece of data that they looked at.

  • You need detective change control in place to track any changes to code or configuration made outside of the Continuous Delivery pipeline.

  • You might also need to worry about data exfiltration: making sure that developers can’t take data out of the system.

These are all ugly problems to deal with, but they can be solved.

At Etsy, for example, even in PCI-regulated parts of the system, developers get read access to metrics dashboards (what Etsy calls “data porn”) and exception logs so that they can help find problems in the code that they wrote. But any changes or fixes to code or configuration are reviewed and made through their audited and automated Continuous Deployment pipeline.

Change Control

How can you prove that changes are under control if developers are pushing out changes 10 or 50 times each day to production? How does a Change Advisory Board (CAB) function in DevOps? How and when is change control and authorization being done in an environment where developers push changes directly to production? How can you prove that management was aware of all these changes before they were deployed?

ITIL change management and the associated paperwork and meetings were designed to deal with big changes that were few and far between. Big changes require you to work out operational dependencies in advance and to understand operational risks and how to mitigate them, because big, complex changes done infrequently are risky. In ITIL, smaller changes were the exception and flowed under the bar.

DevOps reverses this approach to change management, by optimizing for small and frequent changes—breaking big changes down to small incremental steps, streamlining and automating how these small changes are managed. Compliance and risk management need to change to fit with this new reality.

1 “AWS re:Invent 2015 | (DVO202) DevOps at Amazon: A Look at Our Tools and Processes.” https://www.youtube.com/watch?v=esEFaY0FDKc

2 http://www.kitchensoap.com/2009/06/23/slides-for-velocity-talk-2009/

Get DevOpsSec now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.