Chapter 1. Introduction to Cloud Native Data Infrastructure: Persistence, Streaming, and Batch Analytics

Do you work at solving data problems and find yourself faced with the need for modernization? Is your cloud native application limited to the use of microservices and service mesh? If you deploy applications on Kubernetes (sometimes abbreviated as “K8s”) without including data, you haven’t fully embraced cloud native. Every element of your application should embody the cloud native principles of scale, elasticity, self-healing, and observability, including how you handle data.

Engineers who work with data are primarily concerned with stateful services, and this will be our focus: increasing your skills to manage data in Kubernetes. By reading this book, our goal is to enrich your journey to cloud native data. If you are just starting with cloud native applications, there is no better time to include every aspect of the stack. This convergence is the future of how we will consume cloud resources.

So, what is this future we are creating together?

For too long, data has lived outside of Kubernetes, creating a lot of extra effort and complexity. We will get into valid reasons for this, but now is the time to combine the entire stack to build applications faster, at the needed scale. Based on current technology, this is very much possible. We’ve moved away from the past of deploying individual servers and toward the future where we will be able to deploy entire virtual datacenters. Development cycles that once took months and years can now be managed in days and weeks. Open source components can now be combined into a single deployment on Kubernetes that is portable from your laptop to the largest cloud provider.

The open source contribution isn’t a tiny part of this, either. Kubernetes and the projects we discuss in this book are under the Apache License 2.0 unless otherwise noted, and for a good reason. If we build infrastructure that can run anywhere, we need a license model that gives us the freedom of choice. Open source is both free-as-in-beer and free-as-in-freedom, and both count when building cloud native applications on Kubernetes. Open source has been the fuel of many revolutions in infrastructure, and this is no exception.

That’s what we are building: the near future reality of fully realized Kubernetes applications. The final component is the most important, and that is you. As a reader of this book, you are one of the people who will create this future. Creating is what we do as engineers. We continuously reinvent the way we deploy complicated infrastructure to respond to increased demand. When the first electronic database system was put online in 1960 for American Airlines, a small army of engineers made sure that it stayed online and worked around the clock. Progress took us from mainframes to minicomputers, to microcomputers, and eventually to the fleet management we do today. Now, that same progression is continuing into cloud native and Kubernetes.

This chapter will examine the components of cloud native applications, the challenges of running stateful workloads, and the essential areas covered in this book. To get started, let’s turn to the building blocks that make up data infrastructure.

Infrastructure Types

In the past 20 years, the approach to infrastructure has slowly forked into two areas that reflect how we deploy distributed applications (as shown in Figure 1-1):

Stateless services
These are services that maintain information only for the immediate lifecycle of the active request—for example, a service for sending formatted shopping cart information to a mobile client. A typical example is an application server that performs the business logic for the shopping cart. However, the information about the shopping cart contents resides external to these services. They need to be online for only a short duration from request to response. The infrastructure used to provide the service can easily grow and shrink with little impact on the overall application, scaling compute and network resources on demand when needed. Since we are not storing critical data in the individual service, that data can be created and destroyed quickly, with little coordination. Stateless services are a crucial architecture element in distributed systems.
Stateful services
These services need to maintain information from one request to the next. Disks and memory store data for use across multiple requests. An example is a database or filesystem. Scaling stateful services is more complex since the information typically requires replication for high availability. This creates the need for consistency and mechanisms to keep data in sync between replicas. These services usually have different scaling methods, both vertical and horizontal. As a result, they require different sets of operational tasks than stateless services.
Stateless vs. stateful services
Figure 1-1. Stateless versus stateful services

In addition to the way information is stored, we’ve also seen a shift toward developing systems that embrace automated infrastructure deployment. These recent advances include the following:

  • Physical servers have given way to virtual machines (VMs) that are easy to deploy and maintain.

  • VMs have been simplified and focused on specific applications to containers.

  • Containers have allowed infrastructure engineers to package an application’s operating system requirements into a single executable.

The use of containers has undoubtedly increased the consistency of deployments, which has made it easier to deploy and run infrastructure in bulk. Few systems emerged to orchestrate the explosion of containers like Kubernetes, which is evident from its incredible growth. This speaks to how well it solves the problem. The official documentation describes Kubernetes as follows:

Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

Kubernetes was originally designed for stateless workloads, and that is what it has traditionally done best. Kubernetes has developed a reputation as a “platform for building platforms” in a cloud native way. However, there’s a reasonable argument that a complete cloud native solution has to take data into account. That’s the goal of this book: exploring how we make it possible to build cloud native data solutions on Kubernetes. But first, let’s unpack what “cloud native” means.

What Is Cloud Native Data?

Let’s begin defining the aspects of cloud native data that can help us with a final definition. First, let’s start with the definition of cloud native from the Cloud Native Computing Foundation (CNCF):

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.

Note that this definition describes a goal state, desirable characteristics, and examples of technologies that embody both. Based on this formal definition, we can synthesize the qualities that differentiate a cloud native application from other types of deployments in terms of how it handles data. Let’s take a closer look at these qualities:

Scalability
If a service can produce a unit of work for a unit of resources, adding more resources should increase the amount of work a service can perform. Scalability describes the service’s ability to apply additional resources to produce additional work. Ideally, services should scale infinitely given an infinite amount of compute, network, and storage resources. For data, this means scale without the need for downtime. Legacy systems required a maintenance period while adding new resources, during which all services had to be shut down. With the needs of cloud native applications, downtime is no longer acceptable.
Elasticity

Whereas scale is adding resources to meet demand, elasticity is the ability to free those resources when they are no longer needed. The difference between scalability and elasticity is highlighted in Figure 1-2. Elasticity can also be called on-demand infrastructure. In a constrained environment such as a private datacenter, this is critical for sharing limited resources. For cloud infrastructure that charges for every resource used, this is a way to prevent paying for running services you don’t need. When it comes to managing data, this means that we need capabilities to reclaim storage space and optimize our usage—for example, moving older data to less expensive storage tiers.

Comparing scalability and elasticity
Figure 1-2. Comparing scalability and elasticity
Self-healing
Bad things happen. When they do, how will your infrastructure respond? Self-healing infrastructure will reroute traffic, reallocate resources, and maintain service levels. With larger and more complex distributed applications being deployed, this is an increasingly important attribute of a cloud native application. This is what keeps you from getting that 3 A.M. wake-up call. For data, this means we need capabilities to detect issues with data such as missing data and data quality.
Observability
If something fails and you aren’t monitoring it, did it happen? Unfortunately, not only is the answer yes, but that can be an even worse scenario. Distributed applications are highly dynamic, and visibility into every service is critical for maintaining service levels. Interdependencies can create complex failure scenarios, which is why observability is a key part of building cloud native applications. In data systems, the volumes that are commonplace need efficient ways of monitoring the flow and state of infrastructure. In most cases, early warnings for issues can help operators avoid costly downtime.

With all the previous definitions in place, let’s try a definition that expresses these properties:

Cloud native data approaches empower organizations that have adopted the cloud native application methodology to incorporate data holistically rather than employ the legacy of people, process, technology, so that data can scale up and down elastically, and promote observability and self-healing. This is exemplified by containerized data, declarative data, data APIs, data meshes, and cloud native data infrastructure (that is, databases, streaming, and analytics technologies that are themselves architected as cloud native applications).

For data infrastructure to keep parity with the rest of our application, we need to incorporate each piece. This includes automation of scale, elasticity, and self-healing. APIs are needed to decouple services and increase developer velocity, as well as enable you to observe the entire stack of your application to make critical decisions. Taken as a whole, your application and data infrastructure should appear as one unit.

More Infrastructure, More Problems

Whether your infrastructure is in a cloud, on premises, or both (commonly referred to as hybrid), you could spend a lot of time doing manual configuration. Typing things into an editor and doing incredibly detailed configuration work requires deep knowledge of each technology. Over the past 20 years, significant advances have occurred in the DevOps community, both to code and the way we deploy our infrastructure. This is a critical step in the evolution of modern infrastructure. DevOps has kept us ahead of the scale required for applications, but just barely. Arguably, the same amount of knowledge is needed to fully script a single database server deployment. It’s just that now we can do it a million times over (if needed) with templates and scripts. What has been lacking is a connectedness between the components and a holistic view of the entire application stack. Let’s tackle this problem together. (Foreshadowing: this is a problem that needs to be solved.)

As with any good engineering problem, let’s break it into manageable parts. The first is resource management. Regardless of the many ways we have developed to work at scale, fundamentally, we are trying to manage three things as efficiently as possible: compute, network, and storage, as shown in Figure 1-3. These are the critical resources that every application needs and the fuel that’s burned during growth. Not surprisingly, these are also the resources that carry the monetary component to a running application. We get rewarded when we use the resources wisely and pay a literal high price if we don’t. Anywhere you run your application, these are the most primitive units. When on prem, everything is bought and owned. When using the cloud, we’re renting.

Fundamental resources of cloud applications: compute, network, and storage
Figure 1-3. Fundamental resources of cloud applications: compute, network, and storage

The second part of the problem is having an entire stack act as a single entity. DevOps has provided many tools to manage individual components, but the connective tissue between them provides the potential for incredible efficiency—similarly to how applications are packaged for the desktop but working at datacenter scales. That potential has launched an entire community around cloud native applications. These applications are similar to what we’ve always deployed. The difference is that modern cloud applications aren’t a single process with business logic. They are a complex coordination of many containerized processes that need to communicate securely and reliably. Storage has to match the current needs of the application, but remain aware of how it contributes to the stability of the application. When we think of deploying stateless applications without data managed in the same control plane, it sounds incomplete because it is. Breaking your application components into different control planes creates more complexity and thus goes against the ideals of cloud native.

Kubernetes Leading the Way

As mentioned before, DevOps automation has kept us on the leading edge of meeting scale needs. Containerization produced a need for much better orchestration, and Kubernetes has answered that need. For operators, describing a complete application stack in a deployment file makes a reproducible and portable infrastructure. This is because Kubernetes has gone far beyond the simple deployment management popular in the DevOps tool bag. The Kubernetes control plane applies the deployment requirement across the underlying compute, network, and storage to manage the entire application infrastructure lifecycle. The desired state of your application is maintained even when the underlying hardware changes. Instead of deploying VMs, we’re now deploying virtual datacenters as a complete definition, as shown in Figure 1-4.

The rise in popularity of Kubernetes has eclipsed all other container orchestration tools used in DevOps. It has overtaken every other way we deploy infrastructure and shows no signs of slowing down. However, the bulk of early adoption was primarily in stateless services.

Managing data infrastructure at a large scale was a problem well before the move to containers and Kubernetes. Stateful services like databases took a different track parallel to the Kubernetes adoption curve. Many experts advised that Kubernetes was the wrong way to run stateful services and that those workloads should remain outside of Kubernetes. That approach worked until it didn’t, and many of those same experts are now driving the needed changes in Kubernetes to converge the entire stack.

Moving from virtual servers to virtual datacenters
Figure 1-4. Moving from virtual servers to virtual datacenters

So, what are the challenges of stateful services? Why has it been hard to deploy data infrastructure with Kubernetes? Let’s consider each component of our infrastructure.

Managing Compute on Kubernetes

In data infrastructure, counting on Moore’s law has made upgrading a regular event. Moore’s law predicted that computing capacity would double every 18 months. If your requirements double every 18 months, you can keep up by replacing hardware. Eventually, raw compute power started leveling out. Vendors started adding more processors and cores to keep up with Moore’s law, leading to single-server resource sharing with VMs and containers, and enabling us to tap into the vast pools of computing power left stranded in islands of physical servers. Kubernetes expanded the scope of compute resource management by considering the total datacenter as one large resource pool across multiple physical devices.

Sharing compute resources with other services is somewhat taboo in the data world. Data workloads are typically resource intensive, and the potential of one service impacting another (known as the noisy neighbor problem) has led to policies of keeping them isolated from other workloads. This one-size-fits-all approach eliminates the possibility for more significant benefits. First is the assumption that all data service resource requirements are the same. Apache Pulsar brokers can have far fewer requirements than an Apache Spark worker, and neither are similar to a sizable MySQL instance used for online analytical processing (OLAP) reporting. Second, the ability to decouple your underlying hardware from running applications gives operators a lot of undervalued flexibility. Cloud native applications that need scale, elasticity, and self-healing need what Kubernetes can deliver. Data is no exception.

Managing Network on Kubernetes

Building a distributed application, by nature, requires a reliable and secure network. Cloud native applications increase the complexity of adding and subtracting services, making dynamic network configuration a new requirement. Kubernetes manages all of this inside your virtual datacenter automatically. When new services come online, it’s like a virtual network team springs into action. IP addresses are assigned, routes are created, DNS entries are added, the virtual security team ensures that firewall rules are in place, and when asked, Transport Layer Securiity (TLS) certificates provide end-to-end encryption.

Data infrastructure tends to be far less dynamic than something like microservices. A fixed IP with a hostname has been the norm for databases. Analytic systems like Apache Flink are dynamic in processing but have fixed hardware addressing assignments. Quality of service is typically at the top of the requirements list and, as a result, the desire for dedicated hardware and dedicated networks has turned administrators off of Kubernetes.

The advantage of data infrastructure running in Kubernetes is less about the past requirements and more about what’s needed for the future. Scaling resources dynamically can create a waterfall of dependencies. Automation is the only way to maintain clean and efficient networks, which are the lifeblood of distributed, stateless systems. The future of cloud native applications will include more components and new challenges, such as where applications will run. We can add regulatory compliance and data sovereignty to previous concerns about latency and throughput. The declarative nature of Kubernetes networks make it a perfect fit for data infrastructure.

Managing Storage on Kubernetes

Any service that provides persistence or analytics over large volumes of data will need the right kind of storage device. Early versions of Kubernetes considered storage a basic commodity part of the stack and assumed that most workloads were ephemeral. For data, this was a huge mismatch—you can’t let your Postgres datafiles get deleted every time a container is moved. Additionally, at the outset, the underlying block storage ranged from high-performance NVMe disks to old 5400 RPM spinning disks, and you could not always be certain what type of hardware you’d get. Thankfully, this has been an essential focus of Kubernetes over the past few years and has significantly improved.

With the addition of features like StorageClasses, it is possible to address specific requirements for performance, capacity, or both. With automation, we can avoid the point when you don’t have enough of either. Avoiding surprises is the domain of capacity management—both initializing the needed capacity and growing when required. When you run out of capacity in your storage, everything grinds to a halt.

Coupling the distributed nature of Kubernetes with data storage opens up more possibilities for self-healing. Automated backups and snapshots keep you ready for potential data loss scenarios. Placing compute and storage together minimizes hardware failure risks and allows automatic recovery to the desired state when the inevitable failure occurs. All of this makes the data storage aspects of Kubernetes much more attractive.

Cloud Native Data Components

Now that we have defined the resources consumed in cloud native applications, let’s clarify the types of data infrastructure that powers them. Instead of a comprehensive list of every possible product, we’ll break them into larger buckets with similar characteristics:

Persistence
This is likely the category you think of first when we talk about data infrastructure. These systems store data and provide access by some method of a query: relational databases like MySQL and Postgres, and NoSQL systems like Apache Cassandra and MongoDB. These have been the last holdouts to migrate to Kubernetes because of their strict resource needs and high-availability requirements. Databases are usually critical to a running application and central to every other part of the system.
Streaming
The most basic function of streaming is facilitating the high-speed movement of data from one point to another. Streaming systems provide a variety of delivery semantics based on a use case. In some cases, data can be delivered to many clients, or when strict controls are needed, delivered only once. A further enhancement of streaming is the addition of processing: altering or enhancing data mid-transport. The need for faster insights into data has propelled streaming analytics into mission-critical status, catching up with persistence systems in terms of importance. Examples of streaming systems that move data are Apache Flink and Apache Kafka, whereas processing system examples are Apache Flink and Apache Storm.
Batch analytics
One of the first problems in big data is analyzing large sets of data to gain insights or repurpose into new data. Apache Hadoop was the first large-scale system for batch analytics that set the expectations around using large volumes of compute and storage, coordinated in a way to produce the results of complex analytic processes. Typically, these are issued as jobs distributed throughout the cluster, as is common with Spark. The concern with costs can be much more prevalent in these systems because of the sheer volume of resources needed. Orchestration systems help mitigate the costs by intelligent allocation.

Looking Forward

There is a compelling future with cloud native data. The path we take between what we have available today and what we can have in the future is up to us: the community of people responsible for data infrastructure. Just as we have always done, we see a new challenge and take it on. There is plenty for everyone to do here, but the result could be pretty amazing and raise the bar yet again.

Rick’s point is specifically about databases, but we can extrapolate his call to action for our data infrastructure running on Kubernetes. Unlike deploying a data application on physical servers, introducing the Kubernetes control plane requires a conversation with the services it runs.

Getting Ready for the Revolution

As engineers who create and run data infrastructure, we have to be ready for coming advancements, both in the way we operate and the mindset we have about the role of data infrastructure. The following sections describe what you can do to be ready for the future of cloud native data running in Kubernetes.

Adopt an SRE Mindset

The role of site reliability engineering (SRE) has grown with the adoption of cloud native methodologies. If we intend our infrastructure to converge, we as data infrastructure engineers must learn new skills and adopt new practices. Let’s begin with the Wikipedia definition of SRE:

Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.

Deploying data infrastructure has been primarily concerned with the specific components deployed—the “what.” For example, you may find yourself focused on deploying MySQL at scale or using Spark to analyze large volumes of data. Adopting an SRE mindset means going beyond what you are deploying and focusing more on the how. How will all the pieces work together to meet the application’s goals? A holistic deployment view considers the way each piece will interact, the required access, including security, and the observability of every aspect to ensure that service levels are met.

If your current primary or secondary role is database administrator (DBA), there is no better time to make the transition. The trend on LinkedIn shows a year-over-year decrease in the DBA role and a massive increase for SREs. Engineers who have learned the skills required to run critical database infrastructure have an essential baseline that translates into what’s needed to manage cloud native data. These needs include the following:

  • Availability

  • Latency

  • Change management

  • Emergency response

  • Capacity management

New skills need to be added to this list to become better adapted to the more significant responsibility of the entire application. These are skills you may already have, but they include the following:

CI/CD pipelines
Embrace the big picture of taking code from repository to production. There’s nothing that accelerates application development more in an organization. Continuous integration (CI) builds new code into the application stack and automates all testing to ensure quality. Continuous delivery (CD) takes the fully tested and certified builds and automatically deploys them into production. Used in combination (pipeline), organizations can drastically increase developer velocity and productivity.
Observability
DevOps practitioners like to make a distinction between the “what” (the actual service you’re deploying) and the “how” (the methodology of deploying that service). Monitoring is something everyone with experience in infrastructure is familiar with. In the “what” part of DevOps, the properties you monitor let you know your services are healthy, and give you the information needed to diagnose problems. Observability expands monitoring into the “how” of your application by considering everything as a whole—for example, tracing the source of latency in a highly distributed application by giving insight into every hop that data takes as it traverses your system.
Knowing the code
When things go bad in a large, distributed application, the cause is not always a process failure. In many cases, the problem could be a bug in the code or a subtle implementation detail. Being responsible for the entire health of the application, you will need to understand the code that is executing in the provided environment. Properly implemented observability will help you find problems, and that includes the software instrumentation. SREs and development teams need to have clear and regular communication, and code is common ground.

Embrace Distributed Computing

Deploying your applications in Kubernetes means embracing all that distributed computing offers. When you are accustomed to single-system thinking, that transition can be hard, mainly in the shift in thinking around expectations and understanding where problems crop up. For example, with every process contained in a single system, latency will be close to zero. It’s not what you have to manage. CPU and memory resources are the primary concern there. In the 1990s, Sun Microsystems was leading in the growing field of distributed computing and published this list of eight common fallacies of distributed computing:

  • The network is reliable.

  • Latency is zero.

  • Bandwidth is infinite.

  • The network is secure.

  • Topology doesn’t change.

  • There is one administrator.

  • Transport cost is zero.

  • The network is homogeneous.

Behind each of these fallacies is surely the story of a developer who made a bad assumption, got an unexpected result, and lost countless hours trying to solve the wrong problem. Embracing distributed methodologies is worth the effort in the long run. They allow us to build large-scale applications and will continue to do so for a long time. The challenge is worth the reward, and for those of us who do this daily, it can be a lot of fun too! Kubernetes applications will test each of these fallacies, given its default distributed nature. When you plan your deployment, consider things such as the cost of transport from one place to another or latency implications. They will save you a lot of wasted time and redesign.

Principles of Cloud Native Data Infrastructure

As engineering professionals, we seek standards and best practices to build upon. To make data the most “cloud native” it can be, we need to embrace everything Kubernetes offers. A truly cloud native approach means adopting key elements of the Kubernetes design paradigm and building from there. An entire cloud native application that includes data must be one that can run effectively on Kubernetes. Let’s explore a few Kubernetes design principles that point the way.

Principle 1: Leverage compute, network, and storage as commodity APIs

One of the keys to the success of cloud computing is the commoditization of compute, networking, and storage as resources we can provision via simple APIs. Consider this sampling of AWS services:

Compute
We allocate VMs through Amazon Elastic Compute Cloud (EC2) and Auto Scaling groups (ASGs).
Network
We manage traffic using Elastic Load Balancers (ELB), Route 53, and virtual private cloud (VPC) peering.
Storage
We persist data using options such as the Simple Storage Service (S3) for long-term object storage, or Elastic Block Store (EBS) volumes for our compute instances.

Kubernetes offers its own APIs to provide similar services for a world of containerized applications:

Compute
Pods, Deployments, and ReplicaSets manage the scheduling and lifecycle of containers on computing hardware.
Network
Services and Ingress expose a container’s networked interfaces.
Storage
PersistentVolumes (PVs) and StatefulSets enable flexible association of containers to storage.

Kubernetes resources promote the portability of applications across Kubernetes distributions and service providers. What does this mean for databases? They are simply applications that leverage compute, networking, and storage resources to provide the services of data persistence and retrieval:

Compute
A database needs sufficient processing power to process incoming data and queries. Each database node is deployed as a Pod and grouped into StatefulSets, enabling Kubernetes to manage scaling out and scaling in.
Network
A database needs to expose interfaces for data and control. We can use Kubernetes Services and Ingress controllers to expose these interfaces.
Storage
A database uses PersistentVolumes of a specified StorageClass to store and retrieve data.

Thinking of databases in terms of their compute, network, and storage needs removes much of the complexity involved in deployment on Kubernetes.

Principle 2: Separate the control and data planes

Kubernetes promotes the separation of control and data planes. The Kubernetes API server is the front door of the control plane, providing the interface used by the data plane to request computing resources, while the control plane manages the details of mapping those requests onto an underlying infrastructure-as-a-service (IaaS) platform.

We can apply this same pattern to databases. For example, a database data plane consists of ports exposed for clients, and for distributed databases, ports used for communication between database nodes. The control plane includes interfaces provided by the database for administration and metrics collection and tooling that performs operational maintenance tasks. Much of this capability can and should be implemented via the Kubernetes operator pattern. Operators define custom resources (CRDs) and provide control loops that observe the state of those resources, taking actions to move them toward the desired state, helping extend Kubernetes with domain-specific logic.

Principle 3: Make observability easy

The three pillars of observable systems are logging, metrics, and tracing. Kubernetes provides a great starting point by exposing the logs of each container to third-party log aggregation solutions. Multiple solutions are available for metrics, tracing, and visualization, and we’ll explore several of them in this book.

Principle 4: Make the default configuration secure

Kubernetes networking is secure by default: ports must be explicitly exposed in order to be accessed externally to a pod. This sets a valuable precedent for database deployment, forcing us to think carefully about how each control plane and data plane interface will be exposed and which interfaces should be exposed via a Kubernetes Service. Kubernetes also provides facilities for secret management that can be used for sharing encryption keys and configuring administrative accounts.

Principle 5: Prefer declarative configuration

In the Kubernetes declarative approach, you specify the desired state of resources, and controllers manipulate the underlying infrastructure in order to achieve that state. Operators for data infrastructure can manage the details of how to scale up intelligently—for example, deciding how to reallocate shards or partitions when scaling out additional nodes or selecting which nodes to remove to scale down elastically.

The next generation of operators should enable us to specify rules for stored data size, number of transactions per second, or both. Perhaps we’ll be able to specify maximum and minimum cluster sizes, and when to move less frequently used data to object storage. This will allow for more automation and efficiency in our data infrastructure.

Summary

At this point, we hope you are ready for the exciting journey in the pages ahead. The move to cloud native applications must include data, and to do this, we will leverage Kuberentes to include stateless and stateful services. This chapter covered cloud native data infrastructure that can scale elastically and resist any downtime due to system failures, and how to build these systems. We as engineers must embrace the principles of cloud native infrastructure and, in some cases, learn new skills. Congratulations—you have begun a fantastic journey into the future of building cloud native applications. Turn the page, and let’s go!

Get Managing Cloud Native Data on Kubernetes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.