Chapter 4. Data Protection for Kubernetes

Data protection encompasses a broad array of practices and concepts including high availability, backup, disaster recovery, and other processes that support business continuity. Every enterprise maintains and tests data protection policies and programs to minimize downtime and ensure that operations can continue after a disruption. Over the past several years, data protection has also become an important component of compliance. In addition, data protection strategies must provide data privacy as regulations begin to address the substantial amount of sensitive personal data that companies handle from day to day.

Kubernetes Data Protection Challenges

Traditional data protection was focused at the level of a physical or virtual machine, protecting applications and data by securing the server itself. This approach is effective for applications that run on a single host. For containerized applications, however, protection at the server level is insufficiently granular. Targeting the entire server makes it impossible to separate applications, storage, and configuration, and commingles applications that require different data protection policies.

Only by providing data protection at the container level is it possible to apply policies by application, container, or individual unit of Kubernetes storage. In a Kubernetes environment, data protection must be available at the Kubernetes resource level and must include authentication and authorization for distributed data layers throughout the cluster. Backup and disaster recovery must be integrated into the container orchestration environment itself.

Because each application is complex, and because of the large number of applications running on an enterprise cluster, manual backup strategies are impractical. The only way forward is automation; but here, too, complexity reigns.

Scale

Containerized applications are all about scale, including large amounts of data that must be processed at high velocity. Data at scale requires backup solutions that can gather data from a wide array of sources in diverse environments, managing and aggregating them efficiently. These needs include the ability to back up multinode and multicontainer applications, including not just the data but application configuration and state as well.

Distributed Architecture

Cloud native architecture makes data protection challenging. Because applications comprise loosely coupled microservices managed in containers, traditional machine-focused data protection strategies don’t suffice. Traditional backup solutions simply capture the entire state of the VM where an application runs. With distributed, containerized applications this doesn’t work, because the state itself is distributed. Figure 4-1 shows the difference between traditional applications running on VMs and cloud native, containerized applications running on Kubernetes.

Traditional applications versus containerized applications
Figure 4-1. Traditional applications versus containerized applications

Cloud native applications are highly distributed, often across multiple clouds in multiple geographical locations. Containerized application deployments frequently span public and private clouds, sometimes on premises as well. Data protection in these environments must be container aware and able to integrate with container orchestration frameworks.

Namespaces

Kubernetes namespaces are a mechanism for partitioning a single cluster into different isolated groups, either to allocate resources to different business units or to group applications together. An IT admin might need to back up all applications running in a particular namespace at the same time. Because a namespace can contain a large number of pods, it is often not practical to attempt this manually. But traditional backup tools have no awareness of the Kubernetes namespace model and can’t integrate with the Kubernetes API, leaving no alternative to taking a manual backup of every machine.

Recovery Point Objective and Recovery Time Objective

Some applications work with data that is critical in the moment. Other applications might require long-term data fidelity while tolerating momentary interruptions, or might work with data that is neither time sensitive nor mission critical. It is sensible to treat these applications differently, setting policies for each based on compliance, importance, and other aspects of their significance to the business.

Each application, and each type of data, might need its own data protection strategy, with its own recovery point objective (RPO) and recovery time objective (RTO):

  • The RPO determines backup frequency, and represents the recency of recoverable data.

  • The RTO is the maximum amount of time allowed to resume operations after a disruption.

The RPO and RTO policies are determined by the potential impact to the business of downtime, and how much data the business can realistically afford to lose, based on the application and type of data concerned.

In a distributed application at scale, providing high availability and recovery to meet a specific RPO and RTO often means maintaining extra data clusters or moving high volumes of data very quickly. To meet these goals in a modern, cloud native environment, you need automated and application-aware tools that can work with Kubernetes abstractions and APIs, protecting data at the container level.

The continuous integration/continous delivery (CI/CD) pipeline is a good example of an environment that can tolerate a moderate RPO and RTO. Development tools are important, but they don’t immediately impact the customer experience. For these applications, some data loss or interruption to workflow is acceptable.

Data protection must be stricter for applications the customer touches directly, especially those that handle sensitive customer data. For these applications, both RPO and RTO are important. It’s key to be able to recover data to a very recent point, and to resolve interruptions quickly. This sometimes requires disaster recovery across data centers, or failover from one active data center to another.

The most critical applications are those that can’t afford any data loss and can tolerate only very brief interruptions: transaction processing tools, for example. This means a very short RTO time and an RPO of zero, which are demanding requirements to meet. This can only be accomplished with a combination of high availability within each cluster, backup and recovery with a compliance focus, and multiple data centers kept in sync to enable high availability and failover across geographical areas.

Data Protection by Application Type

Different applications not only store different data types and formats, but also employ different data policies. Database applications, for example, store the bulk of their data in tables, but also maintain application state, write application logs, and consume configuration files. For a traditional application running on a VM, the solution is easy: a backup of the VM preserves everything, including application state. For distributed applications, this is not the case. Backup, disaster recovery, and other data protection processes for Kubernetes must be aware of the specific needs of applications and APIs rather than taking a generic approach.

Because application state is distributed, backup solutions must be aware of how each application uses data, and must be able to find and protect all the different kinds of data the application uses. For example, an application-aware backup tool might know that a particular application keeps a queue of pending writes in a cache, and might ask the application to write them to disk before taking a snapshot so that it can capture a complete view of application state.

Strategies for Kubernetes Data Protection

A Kubernetes data protection strategy must provide high availability, backup and recovery, and failover both within and across data centers. These functions must be automated, application aware, and scaled for the enterprise. Data protection must be granular to the container level and aware of Kubernetes topology, storage, and abstractions. Finally, any solution must be able to support the different applications, business requirements, data, and SLAs an enterprise might require.

Container-Aware Backup and Recovery

Traditional backup and recovery methods protect applications and data at the machine level. By backing up the machine, it is possible to restore the previous running state of the applications on the machine. This approach works well for an application that runs discreetly on a single host, but it doesn’t work for a microservices-based application distributed in containers that span multiple nodes in a cluster. For this reason, you must be able to locate the application’s data across the cluster to create an application-consistent backup, meaning that the process makes copies of all the application’s volumes and state at once. Application-consistent backups require domain-specific knowledge of the application to locate all volumes and capture application state properly. Failure to back up data in an application-aware way can lead to data corruption and loss.

A Kubernetes backup tool must be able to integrate with the Kubernetes API, be aware of Kubernetes compute and storage resources, and have the ability to map cluster topology so that it can back up groups of pods or entire namespaces. Finally, a Kubernetes backup tool must be aware of the different requirements for different applications and be able to capture and restore not just persistent storage but application state and configuration as well. In other words, the system must know how to treat different Kubernetes objects that comprise the application, rather than attempting to back up the nodes themselves.

As cloud native architecture takes hold, data protection is moving from IT to a shared responsibility among multiple teams, including application owners. Where traditional backup was centralized, container-aware backup provides application owners with role-based self-service capabilities, including the ability to set their own backup policies and rules to ensure backups are application consistent.

Containerized applications are designed for scale, and backup solutions must be able to scale with them. A single application can scale to thousands of objects, and an enterprise might run hundreds or thousands of such applications. A Kubernetes backup solution must handle many thousands of objects and storage volumes.

Data Protection Within a Single Data Center

Data protection in a single Kubernetes cluster entails ensuring high availability of the Kubernetes components and applications, and maintaining the appropriate data replication. This mainly means setting up Kubernetes to avoid a single point of failure for any service or volume.

Replication strategies vary by application type and business requirements. Simple applications, which don’t handle their own replication, rely on the underlying storage layer to be available without fail. When the data those applications handle is ephemeral and of low value, it may be acceptable to keep only one copy, on the assumption that failure of the node where the volume resides will not have much impact. For more important data, of course, the storage layer should be configured to provide adequate replication, both for data protection and for availability. For data that is important and in high demand, more replicas across more clusters can serve a larger number of clients with lower latency. Some applications handle their own replication. For these applications, the job of the storage is mainly to provide replacement storage in the event that a volume (or its node) fails.

Disaster Recovery Across Data Centers

A comprehensive data protection strategy should complement protection within the data center with failover and disaster recovery to secondary or standby clusters located in different data centers. In this scenario, an active cluster replicates data and configuration to a standby cluster using a timeline determined by RPO and RTO requirements, keeping it in sync so that it can take over if the active cluster fails. Figure 4-2 shows replication from one data center to another.

Replication from an active cluster to a standby cluster in a separate data center
Figure 4-2. Replication from an active cluster to a standby cluster in a separate data center

When replicating data to another data center, it’s important to be aware of the receiving cluster’s topology so that replicas are distributed appropriately among nodes, for both performance and security reasons. The replicated data must provide highly available, performant access while guarding against loss in case of node failures.

There are two strategies for keeping the data centers in sync: synchronous replication, for data that requires stringent RPO and RTO times or immediate failover; and asynchronous replication, for data that can tolerate some loss or unavailability.

Synchronous replication ensures that all of the data written to the active cluster is replicated in the standby cluster. A write to the active cluster is considered complete only when the write to the standby cluster is complete as well. This approach is challenging because it requires very low latency to maintain write performance, but the benefit is that it enables an RPO of zero to support mission-critical data such as transactions. In environments where sufficiently low latency is impossible, you must be able to set migration policies at the level of individual volumes and objects, allowing less critical data to replicate asynchronously to save bandwidth.

Asynchronous replication replicates data from one cluster to another on a schedule, which doesn’t guarantee that the clusters are completely in sync but saves network bandwidth. Applications that can tolerate moderate amounts of data loss or downtime can use asynchronous replication to achieve the appropriate RPO and RTO. The RPO depends on the recency of data that is guaranteed to be copied from the active cluster to the standby cluster. The RTO depends on how much time it takes to restore the application to full functioning in the case that the active cluster becomes unavailable.

Get Container Storage and Data Protection for Applications on Kubernetes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.