Chapter 1. Introduction: The Pillars of Access

Access to computing infrastructure becomes both more important and more difficult at scale. This book explores methods to make access easier, more scalable, and more secure.

Computing infrastructure has a very broad meaning, but what we mean in this book is computing resources located in cloud environments, on-premises, or colocated data centers, and even Internet of Things (IoT) devices on public networks. The definition of computing resources includes hardware such as servers, networking, and storage, as well as infrastructure software such as databases, message queues, Kubernetes clusters, monitoring dashboards, continuous integration/continuous delivery (CI/CD), and other DevOps tooling.

Controlled access to this complex panoply has historically relied on the notion of a network perimeter and user credentials, the digital versions of what people use to control access to their homes: a door and a set of keys. A user credential, such as a password or a private key, is nothing more than a secret piece of information that unlocks a specific perimeter. All these secrets are just data—like any data, they can be lost, shared, copied, stolen, or even sold. Neither a physical key nor a digital key guarantees the identity of someone attempting access. Simply being in possession of the key allows access. When access management relies on secrets, it is giving access not to a client, but to the secret itself. This makes a stolen secret very powerful.

Like most perimeter-based access control implementations, the front door lock on a house does nothing once an intruder gains access. Anyone with the single key to the house has access to all the valuables inside. Additional perimeter defenses inside the home have the same fundamental problem. Once an attacker is inside, everything is accessible.

Corporate networks that rely on secrets and perimeter defenses have the same weakness but worse. Perimeter-based security based on secrets is inadequate because:

  • The keys can be stolen, lost, or shared with someone legitimate and duplicated secretly. In other words, secrets are vulnerable to human error.

  • As infrastructure complexity increases, there can be too many entry points to protect, increasing operational overhead and leading to a growing thicket of vulnerabilities.

  • There can be many users with different access requirements for different resources, making it difficult to grant only the right access to every user.

  • An intruder who manages to gain access can easily pivot the attack to adjacent resources inside the perimeter, wreaking damage along the way.

As a company grows, a secure access model based on a perimeter and secrets does not scale. All of the weaknesses above illustrate the same pattern: an attacker exploits human error, then pivots to get everywhere.

Your computing infrastructure may consist of cloud accounts with API endpoints, virtual machines (VMs), databases, Kubernetes clusters, monitoring dashboards, and CI/CD tools. You may even have traditional data centers with the same resources running in them, and every single resource requires its own access. Configuring connectivity, authentication, authorization, and audit for every resource usually involves maintaining a myriad of configuration files, each with its own compliance requirements and syntax. As the number of resources grows, the complexity of managing access to these components becomes unsustainable—and the cost of a configuration error becomes enormous.

In this chapter, we begin the discussion of identity-native infrastructure access by showing how every breach or attack relies on the very characteristics that traditional security models encourage, reviewing the pillars of infrastructure access. We will lay out the ways that true identity-native infrastructure access solves all these challenges by eliminating the foundation that most infrastructure attacks rely upon: human error and the attacker’s ability to pivot.

Most Attacks Are the Same

Most infrastructure  attacks follow the same human error + pivot pattern:

  1. The attacker first gains a foothold into company-owned computing resources by exploiting human error.

  2. The attacker then pivots to gain access to adjacent computing systems on the same network.

While human errors can be minimized with security training, rigorous recruiting, and other processes, they cannot be eliminated entirely. Humans will reliably be humans. Humans design vulnerable web applications, click malicious email attachments, leave their laptops on the subway, and commit API keys into public Git repositories. These errors and others leave a trail of exploitable vulnerabilities.

Notice how every security-related human error revolves around some kind of secret. Passwords, private encryption keys, API keys, browser cookies, and session tokens are all hubs for human error. Every secret is a potential entry point for a malicious actor. Whenever a new secret is introduced into a computing environment, the probability of a breach increases. This probability may seem insignificant at first, and a successful breach may be unlikely in a small company; but as organizations scale, it becomes a question of when and not if.

While a secret such as a password is intended to prove the identity expressed by the username, it does no such thing. Secrets-based access assumes that only the authorized person can possess the secret, but we know this is not true. A secret confers all the benefits of identity on an entity without any way to perform true authentication or authorization. Anybody with possession of a secret can pretend to be someone else. In that sense, the common term identity theft is misleading because what actually happens is a secret theft. A true identity cannot be shared, copied, or stolen.

The probability of any single human making an error that leads to a compromised secret is relatively small, especially with a competent engineering team and a strong security culture. The introduction of robust processes and secret management solutions also reduce the probability of a secret leakage to an extremely low number, but they never bring it down to zero. In practice, the difference between a low number and zero is enormous when an organization operates at scale.

To use memory corruption in a server as an analogy, the probability of a bit flip is extremely low. But as your infrastructure footprint expands and data volumes continue to grow, eventually there will be bit flips happening every minute. That’s why error correction code is mandatory at scale: it converts the probability of a bit flip from a very small number to zero. This happens for all kinds of low-probability events at scale. In a large data center, full-time employees are hired to replace hard drives all day, despite each drive having an expected lifespan of three years or more.

Reliance on a secret for access is similar. The probability of a human leaking a secret by mistake may seem small. As infrastructure and teams grow, that small probability inevitably increases. As infrastructure becomes larger and more complex, the surface area of secrets becomes enormous, and the aggregated probability of a compromised secret becomes inevitable. That is why in a modern cloud native infrastructure the mere presence of a secret is considered a vulnerability.

It may be tempting to reduce the risk of a leaked secret by introducing more rigid processes. Draconian security procedures, while they provide a comforting illusion of safety, make engineers less productive and create incentives for bad behavior. Hard-to-use security measures lead people to use shorter passwords, build their own backdoors into infrastructure, keep secure sessions open for longer than needed, try to minimize the security team’s involvement in decision making, and take other shortcuts. Even the people who make the most earnest attempts to follow a difficult procedure will end up making a mistake eventually.

It’s not the people who are the problem. It’s the secrets themselves. Secrets are just data; data is vulnerable to theft, loss, and copying.

Companies such as Google and other so-called hyperscalers were among the first to face this reality, and have come up with a more scalable access architecture that hinges on two crucial patterns:

  • No secrets

  • Zero Trust

This book covers these two patterns, explaining how they scale in real time with infrastructure without increasing the attack surface area or the probability of a breach.

No secrets means no passwords, keys, or tokens. Eliminating secrets helps scale secure access because without secrets, there’s nothing to compromise, so human error is no longer exploitable. Instead of relying on secrets, access is based on identity itself.

Zero Trust means that every user, device, application, and network address is inherently untrusted. There’s no perimeter because there’s no “inside” where entities are trusted. In a Zero Trust access model, every network connection is encrypted, every session must be authenticated, every client must be authorized, and the audit log is kept for every client action. With Zero Trust, every computing resource can safely run using a public IP address on an untrusted public network.

Zero Trust greatly reduces the chance of pivot once an attacker gains control over a single machine and reduces the “blast radius” of an attack to just the one system initially compromised.

Together, no secrets and Zero Trust help neutralize the human error + pivot pattern. With no secrets, human error doesn’t introduce vulnerabilities. With Zero Trust, there’s no “inside” to get to, so pivoting becomes meaningless. This gives us the freedom to think about access itself.

Access

Access enables people, software, and hardware to work together securely. At its heart, access is a collection of privileges or permissions that allow a subject (a client) to perform certain actions on an object (a computing resource) for a period of time. Access control means mediating requests by subjects, often users, to access objects, using rights that determine the kind of access a specific subject is granted to a specific object. The intention is to control every access event within a system, protecting data and resources from unauthorized disclosure or modifications while ensuring access by legitimate users.

Infrastructure access management is the ability to define and enforce how any client works with any resource. Managing access is the foundation of security in computer infrastructure, governing the use of all hardware and software, and how information is transferred, processed, and stored.

Generally, managed access is remote because it involves communication among different machines. In modern infrastructure at scale, it’s rare for someone to work only on a single, isolated machine. Remote access management is based on four pillars:

Secure connectivity

Secure communication over an untrusted network

Authentication

Proof of a client’s identity

Authorization

Specifying the actions a client can perform

Audit

A record of real-time and historical events

These four components ensure that the right client has the right kind of access to protected resources, and that it’s possible for the right others to see what’s going on. The next sections provide a quick look at why each pillar is important. Subsequent chapters go into more detail about each one.

Secure Connectivity

Secure connectivity is the first pillar of access. A secure connection must be established before authentication can take place. To access a protected resource securely, an entity must be able to exchange messages without fear of interception.

The legacy approach to connectivity relied on perimeter security, when encryption was needed only for messages leaving the network perimeter, also known as the local area network (LAN) or virtual private cloud (VPC). Anyone within the LAN or VPC was trusted. As infrastructure grows, the network becomes more complicated. Using virtual private networks (VPNs) and firewalls to stitch together perimeters to protect trusted areas becomes extremely challenging and leaves more and more holes.

Even in the best case, perimeter-based security doesn’t work because it makes you vulnerable to attacker pivots. Interestingly, security is not the only argument against the perimeter. As more and more external services need connections into private networks, firewalls are basically just speed bumps. Effectively, the perimeter died a long time ago.

That means there can be no such thing as a trusted network. This is what Zero Trust means. Encryption, authentication, authorization, and audit can’t rely on the network anymore and must shift from the network to the application layer. Requests can no longer be processed based on whether they’re on a trusted network. The network itself becomes untrusted, meaning that communication must be end-to-end encrypted at the session level.

Thankfully, the technologies for this were invented a long time ago and are used for secure communications across the internet. All of us are already using them for online banking or shopping. We simply need to properly apply the same Zero Trust principles to our computing environments inside the LAN or VPC, not just on a perimeter.

Authentication

Authentication means proving identity. A person, computer, service, or other client requesting access to a protected resource must be able to prove it is who it says it is. When you see a login screen, that’s one type of authentication. Authentication must be kept separate from authorization, so that an entity’s permissions can be updated if its role changes. Authentication does not determine what an entity is allowed to do. Authentication only asserts identity.

Verifying passwords is a popular authentication method, but it’s inadequate for proving identity. After all, password-based authentication merely indicates possession of the secret itself and does not prove the bearer’s identity. Authentication must get to the heart of identity, which is a more difficult task. How do you prove the true identity of a person in the digital realm?

One attempt at proving identity is multifactor authentication (MFA), which generally uses two or three different kinds of secrets to establish proof. This pattern is sometimes called know + have + are, and often means a password (something you know), a one-time token generated by a separate device (something you have), and your biological traits (something you are). Unfortunately, common implementations of multifactor authentication simply convert the know + have pair of secrets into a session token or a browser cookie—just another secret, with all the problems that a secret entails.

Authentication is a hard problem, because it means translating the true identity of an entity—who a person is—into a digital form that doesn’t suffer from the same weaknesses as secrets.

Authorization

Once identity is established, authorization determines which actions a subject can perform—for example, read-only access versus full access.

Thinking about authorization, it’s easy to see why secrets-based access is inadequate without a strong tie to an identity. Your house key gives you (or anyone who possesses it) the ability to enter your home, but it is your identity that gives you authorization to do so. You can grant authorization to others, allowing them to perform specific actions. You might authorize someone to repair a leaky faucet or invite a friend to dinner. Hopefully, you’re granting these permissions based on identity rather than possession of a house key.

Authorization is separate from authentication but relies on it. Without knowing who is requesting access to a resource, it’s impossible to decide whether to grant access. Authorization consists of policy definition and policy enforcement: deciding who has access to which resources and enforcing those decisions. The matrix of entities and permissions can be very large and complex and has often been simplified by creating access groups and categorizing resources into groups with defined permissions. This simplifies policy management but increases the blast radius of a breach. Someone with a stolen credential gains access to a broad group of resources based on the role to which the credential is assigned.

Audit

Audit shows which actions have been taken by every user or machine and which resources have been affected. The necessity of identifiable audit records—knowing who did what—is another reason why perimeter-based access does not scale. If you rely on a network boundary to authenticate clients, and the resources on an internal network are not protected, it means that all users become merged as a single “guest” (or worse: “admin”), making the audit logs useless.

Once access shifts away from a perimeter-based approach to the resource and application level, generating more detailed and granular events, it becomes even more important to have both a real-time view and a historical record of access. Audit typically falls under the security terminology of manageability and traceability, with the important point being that you actually know and control what is going on in your environment.

Identity-native infrastructure access management provides a great deal of control over individual access privileges, but the flip side of that is the responsibility to ensure that privileges are revoked when they are no longer needed. Regular audits can help minimize the risk of privileges being assigned incorrectly or lingering beyond when they’re needed. In other words, auditing is another hedge against human error.

Having a real-time view and a historical record of access is a critical security capability. Shifting away from a perimeter-based approach to identity-native infrastructure access provides a great deal of control over access, because with audits each access can be tied back to an identity at an individual level.

Security Versus Convenience

Security and convenience are famously at odds with each other. When we approach our house after a grocery run, we are forced to slow down, put the bags on the porch, and reach for the keys. This is hardly convenient, especially when it rains!

The inconvenience of security is even more evident in computing environments. Quite often, there are two groups of engineers involved in making decisions about remote access. On the one hand, we have software developers who need to fix bugs quickly, ship new features to customers, improve performance, and troubleshoot abnormalities—all under a tight timeline. On the other hand, there are security and compliance engineers who are primarily concerned with risk.

These two groups have wildly different incentives. Software developers don’t want security in the way because it slows them down—and in many cases the way security is measured really has nothing to do with actual security. Security and compliance engineers are more concerned with reducing risk than with how fast things get done. As a result, there’s often tension between developers and security engineers, which sometimes takes the form of open conflict. A trade-off needs to be found.

Organizations approach this in a variety of ways. Smaller technology startups err on the side of productivity, because their primary risk is not a security risk but a business risk. They may still be focusing on finding the product market fit, so the speed of product iteration is more important than compliance. As they mature, the balance starts to shift toward more robust security practices and compliance enforcement, trading off some of the product development velocity in the process.

The industry shift to cloud computing has contributed to this dilemma. Engineers have gained more control over their infrastructure because the infrastructure itself is now provisioned with code. Oppressive security processes create incentives for engineers to implement their own shortcuts, which is easy with infrastructure-as-code provisioning. Often, management believes they have adopted solid security measures, while in reality their engineering team has devised its own ways of accessing cloud environments. This approach is called security theater.

Therefore, we can conclude that an infrastructure access system is only secure if the engineering team actually loves using it.

Scaling Hardware, Software, and Peopleware

The definition of infrastructure is expanding. As remote work and personal devices become part of the workplace, and diverse computing environments proliferate, the surface area of what we once thought of as infrastructure has become impossibly complex. It’s no longer practical to think in terms of networks and perimeters. Think of a company like Tesla with a network of charging stations and millions of vehicles around the globe, all of them equipped with numerous CPUs, storage, and connectivity. What do they deploy software updates to? Their deployment target is planet Earth!

As infrastructure expands, we need to realize that it’s not homogeneous. There are many different kinds of resources and users, each with different roles, needs, and policies. We need to enforce different behaviors in different contexts: development, staging, test, and production, for example. We need to protect the entire software development supply chain from vulnerabilities (human error), and to limit the blast radius in case of a breach. Managing access securely in all these environments, with so many related goals and moving parts, is immensely complex.

Infrastructure has been able to scale quickly by moving from managing physical devices to using APIs for provisioning virtual devices. Because everything can be defined and managed as code (infrastructure as code [IaC]), it’s easy to scale elastically by provisioning more of whatever resource you need. Networks are dynamic and resources are completely fungible. Everything listens on its own network socket and needs access to some number of other resources. Tracking and blocking every network socket and endpoint to prevent infiltration would be impossible.

Ultimately, the difficulty in managing infrastructure access comes from scaling all three major elements of a computing environment:

Hardware

The physical components that make up the system, including servers, storage, personal computers, phones, and networking devices

Software

Containers, Kubernetes clusters, databases, monitoring systems, internal web applications, services, and clients that communicate with each other within a VPC or across networks

Peopleware

The human role in information technology, including software developers, DevOps, and security teams

All three of these elements are growing more complex as they scale. It’s common for an organization to have tens of thousands of geographically distributed servers running more and more diverse cloud computing environments that include VMs, containers, Kubernetes clusters, databases, and an army of logging and monitoring tools. This leads to access silos across these dimensions:

  • Hardware access is siloed because cloud infrastructure is accessed differently from the older environments colocated in a traditional data center.

  • Software access is siloed because databases are accessed via a VPN, Secure Shell (SSH) is accessed via a series of jump hosts with private keys stored in a vault, and CI/CD tools are accessible on a public IP address and hooked to a corporate single sign-on (SSO) system.

  • Peopleware access is siloed because some teams use manually provisioned accounts with a password manager to access their systems, others use SSO, and other teams have special requirements—such as a compliance team that allows certain types of access only from an approved laptop stored in a safe.

At the same time, as teams become more distributed and elastic, relying on contractors and other outside contributors, it’s necessary to quickly provision, manage, and, importantly, deprovision access.

Figure 1-1 shows how silos inevitably begin to appear at scale. Each piece of software (represented by the shaded rectangles) has its own access method—already a silo. As the infrastructure sprawls to multiple environments, such as Amazon Web Services (AWS) and on-prem, this creates additional silos that are orthogonal to the software silos. As an organization at scale hires elastically, the contractors are likely to be segregated to their own access methods for security reasons. Even worse, different roles are sometimes forced into different access methods. The result is a multidimensional matrix of siloes that make consistent application of access policy all but impossible.

As we automate more and more tasks, the role of a human is supposed to be decreasing with time. To make this work, software needs the ability to communicate securely and autonomously with other software to support automated processes such as CI/CD deployments, monitoring, backups, the interactions of microservices in distributed applications, and dynamic delivery of information. Traditional security methods use tokens, cookies, and other secrets tailored to the growing number of separate tools with slightly different security protocols. This not only doesn’t scale but provides no way to track and correct human errors when they lead to vulnerabilities and breaches.

How silos emerge as infrastructure scales
Figure 1-1. How silos emerge as infrastructure scales

In other words, the separation between humans accessing machines and machines accessing each other creates yet another access silo: humans versus machines.

The most vulnerable component in an information system is the peopleware: users, administrators, developers, and others who access the system. Every breach can be traced back to a human error somewhere. The complexity of working with so many different access protocols, with their associated secrets and multifactor authentication procedures, leads people to take shortcuts: remaining logged in between sessions, reusing passwords, writing down secrets, and other bad behaviors. This tendency increases the probability of an exploitable mistake. Often, well-intended new security measures increase drag on the people who use them, leading them to cut corners even more to preserve some level of productivity.

The point is that a growing number of access silos, each with its own secrets, protocols, authentication, and authorization methods, and so on, leads to an unmanageable labyrinth of vulnerabilities that tend to increase as the complexity of access starts to interfere with people’s ability to be productive.

To solve the problem, it’s necessary to reduce complexity, which will not only improve the user experience but improve security by making it more manageable. While we’re at it, to remove the element of human error, it would be beneficial to move away from a secrets-based model of security. But reducing the probability of human error requires more than reducing the number of secrets. It’s also necessary to tame the complexity of configuring access for a vast array of different kinds of resources, breaking down silos by bringing hardware, software, and peopleware under a single, unified source of truth for access policy.

Unifying access control across humans, machines, and applications reduces the need for expertise to configure connectivity, authentication, authorization, and audit in all these different systems; reduces complexity overall; and makes consistent auditability possible. Reducing complexity, in turn, gets security out of the way of convenience and productivity, giving engineers fewer reasons to take shortcuts or otherwise undermine security policy.

It turns out that there is an approach that can accomplish these goals.

Identity-Native Infrastructure Access

The whole point of identity-native infrastructure access is to move away from secrets entirely. Secrets are just data, and data is vulnerable to human error. True identity is not data that can be downloaded, copied, or stolen. True identity is a characteristic of the physical world. You are you. The most difficult aspect of granting access based on identity is the problem of representing physical identity digitally. Secrets tied to usernames were a futile attempt to bring user identities into the digital realm.

The idea of using a centralized identity store was the first attempt at reducing the number of secrets within an organization. Instead of each application maintaining its own list of user accounts with accompanying logins and passwords, it makes sense to consolidate all user accounts in one database and have it somehow shared across many applications. That’s pretty much how centralized identity management (IdM) systems work. They consolidate user accounts into a single location and offer a standardized API for applications to access them. For example, when a user requests access to a web application, the application redirects the user to an IdM system like Okta or Active Directory. The IdM presents its own login to authenticate the user, transfers a representation of the user’s identity back to the user’s computer, and redirects the user back to the application they are trying to access, supplied with the user’s identity in the form of a token.

Figure 1-2 shows the SSO login sequence top to bottom:

  1. An unauthenticated user (no token, no session cookie) tries to access a web app.

  2. The web app redirects the user to an IdM such as Active Directory.

  3. The user logs into the IdM with their credentials.

  4. The IdM sends the user’s identity as a token.

  5. The user can now gain access to the web app by supplying the authentication token.

The SSO login sequence
Figure 1-2. The SSO login sequence

It is easy to see why this approach scales better: no matter how many applications or other identity-aware resources an organization deploys, the number of secrets stays the same. Moreover, the overhead of provisioning and deprovisioning access for employees joining or leaving a team remains the same.

While IdM is a big step forward, it does not eliminate secrets entirely; it merely reduces their number. Let’s not forget about the browser cookie. Because a cookie is just another secret, it doesn’t solve the underlying problem. Steal the cookie and you can become someone else.

Another practical problem with identity management systems is that they were primarily developed with web applications in mind. The common protocols used with IdM are HTTP protocols: Security Assertion Markup Language (SAML), OAuth2, and OpenID Connect. Meanwhile, computing infrastructure relies on much older resource-specific protocols such as SSH, remote desktop protocol (RDP), MySQL, PostgreSQL, and others that do not work in a browser and cannot natively be integrated with an IdM system.

Therefore, the two primary challenges in implementing identity-native infrastructure access are:

  • Moving away from storing identity as secret data

  • Finding a way to transfer identity using native infrastructure protocols

Moving identity management away from secrets means attaching to a real, physical world identity by using biometric authentication for humans and hardware security modules for machines.

The next question is how to transfer true identity into the digital realm because an access system needs to interact with true identity somehow. The best currently available mechanism to safely transfer true identity into an access system is digital certificates. Table 1-1 shows how certificates compare to secrets as an access control mechanism.

Table 1-1. A comparison of the characteristics of certificates and secrets
  Certificates Secrets
Standardized across protocols and applications Yes No
Vulnerability to theft Low High
Management overhead Low High
Identity and context metadata Yes No

Certificates can be issued to machines, humans, and applications. Certificates are natively supported by common infrastructure protocols. Certificates are safer than secrets because they are far less exposed to theft and misuse. A certificate can be revoked, set to expire automatically, be issued for a single use, and pinned to a specific context (intent) and network address. This makes stealing a certificate nearly pointless. The certificate chain of trust back to a certificate authority (CA) leaves only a single secret to protect—the CA itself—no matter the scale of the organization. In other words, this approach scales forever without compromising security.

A modern, certificate-based central identity management system holds no other secrets besides the CA. True identity of clients is stored not in a database, but in the real world as identifying physical attributes of humans and machines: human biometrics, hardware security modules, and trusted platform modules. The identity transfer is carried in a certificate that can be pinned to a context and limited in time. This brings access effectively into a single plane where all the pillars of access happen uniformly based on every entity’s true identity.

This approach is the foundation not only of providing stronger security and more convenient access for users but dealing with challenges of scale and complexity. The approach rests on the two principles mentioned earlier: removing secrets (to eliminate human error) and Zero Trust (to make a pivot impossible if a breach occurs).

This is the approach hyperscale companies have adopted. Moving away from secrets means moving toward digital representation of the true identities of hardware, software, and peopleware. Zero Trust means not just encrypting all connections but designing all infrastructure components to be safe without a firewall—because there’s no perimeter.

The following chapters explain how it’s done, and why it doesn’t have to be painful.

Get Identity-Native Infrastructure Access Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.