In this episode, I talk with Doug Barth, site reliability engineer at Stripe, and Evan Gilman, Doug’s former colleague from PagerDuty who is now working independently on Zero Trust networking. They are also co-authoring a book for O’Reilly on Zero Trust networks. They discuss the problems with traditional perimeter security models, rethinking trust in a networked world, and automation as an enabler.
Here are some highlights:
The problem with perimeters
Evan: The biggest issue with a perimeter model is that it tends to encourage system administrators to define as few perimeters as possible. You have your firewall, so anything out on the internet is bad, anyone on the inside is trusted, and maybe down the line you'll further segment this and add more firewalls. Maybe if you're really rigorous, you might do per-host firewalls, but in reality, most people say, ‘It's on my trusted network, so it's a trusted interaction. Why should I go through that effort? What's the value?’ The issue with that thought process is that we keep seeing bad people get behind the perimeter, time and time again. Once they get behind it, they can just do whatever they want.
Doug: The alternative is proactively figuring out how to manage the trust in your network. Whom do you trust? Why do you trust them? Do you have enough trust for them? When I want to build a secure network, my goal is not to remove people's access; it's to help distribute the problem and get enough eyes on whom I trust and whether I should continue to trust them. It's a trust-but-verify approach.
Moving to Zero Trust
Evan: Shifting from a perimeter security model to Zero Trust is scary. But the good news is, we know how to do this already. We have internet-facing services, and we know how to serve up resources across the internet and secure them so they network between you and the resources. It's transparent from a security perspective. VPNs famously do this. Secure Sockets Layer (SSL) websites and other similar approaches are what we consider "internet security," and we already know how to do this. In a Zero Trust approach, we just apply it across the board, and use automation as a key enabler. Large migrations to a low trust network, like Google’s recent effort, involve a lot of auditing prep and very careful implementation. For instance, you need to craft policies on a case-by-case basis and turn them in logging mode only, so you're aware of whom will be blocked before you actually block them.
Automation as an enabler
Doug: Each engineering team in a company should be able to define the security policy that their individual service needs to function. We distribute that problem across many teams, but then we push all those policies into a secure infrastructure that actually implements that policy. This isn't just a crazy idea we had. This is how I understand Google's BeyondCorp initiative works. Google wanted to get rid of their VPNs but still have a lot of secure policies. They call what they built a ‘shared access gateway.’ They give each engineering team a digital subscriber line (DSL) for defining each of their security policies, but the shared access gateway is what actually implements the policy. They layer on top of that the broad-reaching policy for the entire organization. This type of automation—the ability to programmatically define your policy and your enforcement—allows you to give people a lot more access. Once you start capturing all this policy and how it changes over time in code, you can do much more advanced security policy or security enforcement in your network.
Evan: Having that policy definition in code is something you can use to programmatically generate enforcement rules. Those enforcement rules can vary based on the underlying platform or a condition, but the key is that they’re generated by a computer. This allows you to rapidly change the enforcement rules, and paves the way to highly dynamic policy, as opposed to the more static policy we see in perimeter networks.
Evan: The first place to start is collecting policy and understanding what should be there. Once you know what should be there, you can understand what is there unexpectedly or what is there but should not be. You can build up this list and slowly move things from blacklist to whitelist mode saying, ‘We'll only allow these things known on this list.’ Once you can get to this whitelist mode, it becomes largely self-maintaining. At PagerDuty, we started small. First, we put regular IP policy in place, but that IP policy was going to be automated and backed by code. Once we got that in and vetted, we turned the knob up on granularity. We spread that granularity to more places and eventually turned up the encryption. It's totally acceptable to adopt one or two of the principles we're setting forth here and then add the rest when the time is appropriate. Additionally, it doesn't have to be for 100% of your infrastructure. You can start with the parts of your infrastructure that could benefit the most from it.
Doug: If you tilt your head the right way, you could argue that Amazon's security groups are like a shade of Zero Trust network, in that AWS users could arrange their network into nicely crafted subnets or just start tagging hosts with certain security groups and using those to define policy. If you're on AWS, do a security group per role, then use that everywhere to define access. That will get you part of the way there, and you leave yourself open to extending out to different providers later.
When the inevitable happens
Evan: First and foremost, a Zero Trust approach dramatically slows down any potential breach. And breaches are rare in this kind of architecture because the policies are so granular and the movement is so limited. Another benefit of a Zero Trust network is its robust auditing and logging. When policies get changed programmatically, for example, you'll have a record of it. When a breach does happen, not only is the progression of the attack very slow, but you also have very good visibility into exactly what occurred and when and how.
Doug: This is a key point in this type of network design. It's not just about enforcing; it's also about continual monitoring of changes of state. You build yourself a way to detect problems and perform forensics. The ultimate benefit is making that feedback loop where your trust in some system is directly driven from log traffic of what that system is currently doing so you can detect anomalies. If someone logs into an organization’s network from a potentially risky region of the world, then their authorization level can be knocked down until they’re seen in person to validate their access. Organizations can adopt policies that require staff to visit corporate headquarters regularly; otherwise, their trust level gets knocked down. Simplistic policies like that aren't doing any fancy machine learning, but even that sort of basic additional layer of security can help.