O'Reilly logo

Kubernetes Operators by Joshua Wood, Jason Dobies

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introducing Operators

Kubernetes makes it easy to automate the lifecycle of a stateless application, such as a static web server. Without state, any instances of an application are interchangeable. This server retrieves files and sends them on to a visitor’s browser. Because our example web server is not tracking state or otherwise making persistent changes, when one server instance fails, Kubernetes can simply and automatically replace it with another. These instances, each a copy of an application running on the cluster, are referred to as replicas.

How Kubernetes Works

A Kubernetes cluster is a collection of computers, called nodes. All cluster work runs on one, some, or all of a cluster’s nodes. The basic unit of work, and of replication, is the pod. A pod is a group of one or more Linux containers with common resources like networking, storage, and access to shared memory.


https://kubernetes.io/docs/concepts/workloads/pods/pod/ is a good starting point for more information about the Pod abstraction that isolates applications running on Kubernetes clusters.

At a high level, a Kubernetes cluster can be divided into two planes. The control plane is, in simple terms, Kubernetes itself. A collection of pods comprises the control plane and implements the Kubernetes API and cluster orchestration logic.

The application, or data, plane is everything else. It is the group of nodes where application pods run. One or more nodes are usually dedicated to running applications, while one or more nodes are often sequestered to run only control plane pods. As with application pods, multiple replicas of control plane components can run on multiple controller nodes to provide redundancy.

The control plane implements control loops that repeatedly compare the desired state of the cluster to its actual state, taking action to make the actual state match the desired state when they diverge. Operators build atop and extend this behavior. This simplified schematic shows the major control plane components, with worker nodes running application workloads below.

Figure 1-1: Kubernetes Control Plane and Worker Nodes
Figure 1-1. Kubernetes Control Plane and Worker Nodes

Example: Stateless Web Server

In the following terminal excerpt, we use the kubectl command-line Kubernetes API client to deploy such a stateless, static Apache web server:

$ kubectl deploy staticweb --image=joshix/caddy --replicas=1
$ kubectl get pods
NAME                READY     STATUS    RESTARTS   AGE
caddy-42-nwfm4   1/1       Running   0          23h

Next, we scale our static web server deployment from a single instance to three replicas. After declaring there should be three replicas, the cluster’s actual state differs from the desired state, and Kubernetes starts two new instances of the web server to reconcile the two.

$ kubectl scale deployment staticweb --replicas=3
$ kubectl get pods
NAME                READY     STATUS    RESTARTS   AGE
caddy-42-ls2ms   1/1       Running   0          11s
caddy-42-nwfm4   1/1       Running   0          23h
caddy-42-vfhsp   1/1       Running   0          11s

Now we delete one of the web server pods. Again, Kubernetes takes action to make the cluster’s actual state match the desired state of three replicas. It starts a new pod to replace the one we killed. In the excerpt below, the replacement pod shows a STATUS of ContainerCreating:

$ kubectl delete pod caddy-42-nwfm4
$ kubectl get pods
NAME                READY     STATUS              RESTARTS   AGE
caddy-42-ls2ms   1/1       Running             0          1m
caddy-42-tj55d   0/1       ContainerCreating   0          4s
caddy-42-vfhsp   1/1       Running             0          1m

This static site’s web server is interchangeable with any other replica, or with a new pod that replaces one of the replicas. It doesn’t store data, or maintain state in any way. Kubernetes doesn’t need to make any special arrangements to replace a failed pod, or to scale to more or fewer replicas of this server.

Stateful is Hard

Most applications have state. They also have particulars of startup, component ordering and configuration. They often have their own notion of what “cluster” means. They need to record and reliably store critical, complex, and often voluminous data, like medical records and financial transactions. Those are just three of the dimensions in which real-world applications must maintain critical state. Nevertheless, it would be ideal to manage them with uniform mechanisms throughout an application stack, especially if we can automate the operation of applications with complex storage, networking, and cluster connection requirements.

Kubernetes cannot know all about every stateful, complex, clustered application while also remaining general, adaptable and simple. It aims instead to provide a set of flexible abstractions, covering the basic application concepts of scheduling, replication, and failover automation, while providing a clean extension mechanism for more advanced or application-specific operations. Kubernetes, on its own, does not and should not know the configuration values for, say, a PostgreSQL database cluster, with its arranged memberships and necessarily stateful, persistent storage.

Operators are Software SREs

You might call Site Reliability Engineering (SRE) a philosophy, or, if you’re more grounded, a set of patterns and principles for running large systems. SRE has been a hot buzzword, and it has had a pronounced influence on industry practices. SRE is open to interpretation and application to particular circumstances, but its core is about automating systems and application administration and designing for repeatable deployments in the pursuit of increased reliability.

An Operator is like an automated Site Reliability Engineer for its application. It encodes in software the skills of an expert administrator. An Operator can manage a cluster of database servers, for example. It knows the details of configuring and managing its application, and it can install a database cluster of a declared software version and number of members. More distinctively, an Operator continues to monitor its application as it runs, and it can back up its data, recover from failures, and upgrade its application over time — automatically. Cluster users employ kubectl and other standard tools to work with Operators and the applications they manage, because Operators extend Kubernetes.

How Operators Work

Operators work by extending the Kubernetes control plane and API. In its simplest form, an Operator adds an endpoint to the Kubernetes API, called a Custom Resource (CR), along with a control plane component that monitors and maintains resources of the new type. This Operator, running in the control plane, can then take action based on the resource’s state.

Figure 1-2: Operators are Custom Controllers watching a Custom Resource
Figure 1-2. Operators are Custom Controllers watching a Custom Resource

Kubernetes Custom Resources

Custom Resources are the API extension mechanism in Kubernetes. A Custom Resource Definition (CRD) defines a CR. CRDs are analogous to a schema for the Custom Resource data. Unlike members of the official API, a given CRD doesn’t exist on every Kubernetes cluster. CRDs extend the API of a particular cluster where they are defined. Custom Resources provide an endpoint for reading and writing structured data. A cluster user can interact with CRs with kubectl or another Kubernetes client, just like any other API resource.

How Operators are Made

Kubernetes compares a set of resources to reality; that is, the running state of the cluster. It takes actions to make reality match the desired state described by those resources. Operators extend that pattern to specific applications on specific clusters. In the simplest terms, an Operator is a custom Kubernetes Controller watching a Custom Resource type and taking application-specific actions to make reality match the spec in that resource.

Therefore, at the highest level, making an Operator means creating a CRD and providing a program that runs in a loop watching CRs of that kind. What the Operator does in response to changes in the CR is specific to the application the Operator manages. The actions an Operator performs can include almost anything: Scaling a complex app, application version upgrades, or even managing kernel modules for nodes in a computational cluster with specialized hardware.

Example: The etcd Operator

Etcd is a distributed key-value store. In other words, etcd is a kind of lightweight database cluster. An etcd cluster usually requires a knowledgeable administrator to manage it. An etcd administrator must know how to:

  • Join a new node to an etcd cluster, including configuring its endpoints, making connections to persistent storage, and making existing members aware of it

  • Backup the cluster data and configuration

  • Upgrade the cluster to new etcd versions

The etcd Operator knows how to perform those tasks.

The Case of the Missing Member

The etcd Operator can recover from an etcd cluster member’s failure in the same way Kubernetes replaced our deleted stateless web server pod. Assume we have a three-member etcd cluster managed on Kubernetes by the etcd Operator. We can see the Operator itself and the cluster members running as Pods:

$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
etcd-operator-6f44498865-lv7b9    1/1       Running   0          1h
example-etcd-cluster-cpnwr62qgl   1/1       Running   0          1h
example-etcd-cluster-fff78tmpxr   1/1       Running   0          1h
example-etcd-cluster-lrlk7xwb2k   1/1       Running   0          1h

We pick our least favorite etcd pod and kill it off. The Operator knows how to recover to the desired state of three replicas. Unlike the blank-slate restart of a stateless web server, the Operator has to arrange the new etcd pod’s cluster membership, configuring it for the existing endpoints and establishing it with the remaining etcd members.

$ kubectl delete pod example-etcd-cluster-cpnwr62qgl
$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
example-etcd-cluster-fff78tmpxr   1/1       Running   0          1h
example-etcd-cluster-lrlk7xwb2k   1/1       Running   0          1h
example-etcd-cluster-r6cb8g2qqw   0/1       PodInitializing   0         4s  1

We see the replacement pod in the PodInitializing state as it joins the etcd cluster.

The etcd API remains available to clients as the Operator repairs the etcd cluster. In Chapter 2, we’ll deploy the etcd Operator and put it through its paces in more detail, while we use the etcd API to read and write data. For now, it’s worth remembering that adding a member to a running etcd cluster isn’t as simple as just running a new etcd pod and the etcd Operator has hidden that complexity and automatically healed the etcd cluster for us.

Who are Operators for?

Operators can make life easier for folks at three levels of the ecosystem. First, the Operator pattern arose in response to systems integrators and developers wanting to extend Kubernetes at a system level. Operators have a large and growing role behind the scenes in OpenShift, for example. Just run kubectl get crd against an OpenShift 4 cluster to get an idea of how important Operators are to providing OpenShift PaaS and management features atop the Kubernetes core.

Second, Operators make it easier for cluster administrators to enable, and developers to use, foundation software pieces like databases and storage systems that might have their own management overhead. If the “killernewdb” database server that’s perfect for your application’s backend has an Operator to manage it, you can deploy killernewdb without becoming an expert killernewdb DBA.

And, for that reason, application developers will want to use the Operator pattern to build an installer, monitor, and upgrade agent that makes it much easier for customers to deploy their application on scalable, reliable clusters.

Operator Adoption

The Operator pattern has seen adoption by a wide variety of developers and companies, and there are already many Operators available right now that make it easier to use key services as components of your applications. The CrunchyData team has developed an Operator that manages PostgreSQL clusters. There are popular Operators for MongoDB and Redis. We’ll run and test some of these later in the book.

Moreover, Kubernetes-based distributions like Red Hat’s OpenShift use Operators to run the added features they layer on a Kubernetes core, keeping the OpenShift web console available and up to date, for example. On the user side, OpenShift has added mechanisms for point-and-click Operator installation and use in the web console, and for Operator developers to hook into the Operator Hub.

The promise of Operators is the ability to consume managed services of cloud-like scale, automation, and reliability, but on Kubernetes clusters running in your own data centers, in disconnected airgaps, or wherever.

Let’s Get Going

Operators need a Kubernetes or OpenShift cluster to run on, and in the next chapter we’ll talk about how to get access to a cluster, whether its a local virtual Kubernetes on your laptop or an external service offering. Once you have admin rights to a Kubernetes cluster, we’ll deploy the etcd Operator and show how it manages an etcd cluster on your behalf.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required