Chapter 4. Platform Architecture
This chapter will dive deeper into the details of what it means to build out a platform architecture that enables organizations to create applications, run them in an optimized infrastructure, and efficiently manage application life cycles. For data pipelines, this includes ingestion, storage, processing, and monitoring services on Kubernetes.
In this chapter we’ll cover the container orchestration, software, and hardware layers needed to create a complete platform that eliminates data silos and enables workflows. It will serve as a general architecture overview. Chapter 5 will cover specific examples that would be implemented on such an architecture and how they might be deployed.
Hardware and Machine Layers
Figure 4-1 shows the architecture for OpenShift, an enterprise Kubernetes distribution. We won’t go into detail about all of the layers shown in this diagram—detailed information can be found in the OpenShift documentation. We will use the OpenShift architecture to highlight some key Kubernetes layers that play a critical role in the platform architecture for data pipelines.
Hardware
Layer 1 describes the type of hardware in which Kubernetes can be deployed. If you have an on-premise data center, this would be your bare metal hardware or virtual servers. If you’re deploying to the public or private cloud, you may choose AWS EC2 instances, Azure Virtual Machines, or other options.
One benefit of Kubernetes is the ease with which it allows you to combine hardware types. In our internal Red Hat data center, we run our on-premises workloads on a mixture of bare metal nodes and virtual servers in one Kubernetes cluster (see Figure 4-2). This allows more flexibility for processes that demand lower latency performance optimizations such as Elasticsearch on bare metal nodes that are backed by NVMe storage.
In Figure 4-2, hot data contains the most recent indexes, which demand high read and write throughputs due to the frequency of data coming in and of users querying the results. An example of hot data would be the last seven days of systems log data. Warm data might be systems logs older than seven days that are read-only and only occasionally queried. In this situation, spinning disks with higher latency and cheaper costs are sufficient.
Physical and Virtual Machines
Let’s look next at the node layer (layer 2 in Figure 4-1. In Kubernetes, physical and virtual machines are mapped to nodes (Figure 4-3). In most instances, each node contains many pods that are each running specific container applications. For example, a single node can have a mashup of pods running Kafka brokers, Flask web applications, and application microservices.
You can let the Kubernetes native scheduler take care of prioritizing the node in which an application should reside, or you can create a custom scheduler. (If you want to know about Kubernetes schedulers, you can learn more in the Kubernetes documentation.) We’ll expand this layer a bit more to show how you can deploy applications and microservices to create a data processing pipeline.
Persistent Storage
Depending on the application you’re running in a pod, you may need persistent storage. If you have a stateless application or model being served as a microservice, more than likely persistent storage isn’t required. If you are running a database or a message bus or simply need workspace for an interactive online workbench, then layer 3 (fee Figure 4-1) is required. Each container in a pod can have one or more persistent storage volumes attached to it (see Figure 4-4). In the case of a database, you might have one for log files, another for temporary disk space, and yet another for the actual database data.
As we discussed in Chapter 3, picking the right storage for your workloads can be very important. With Elasticsearch, it was critical that we deployed hot data to NVMe-backed nodes because of the extremely high volumes of reads and writes that are required. Slower disks would cause significant bottlenecks in performance and eventually lead to the Elasticsearch application nodes losing data due to lack of throughput (which we experienced firsthand).
Your typical web application PostgreSQL database doesn’t need as much low-latency storage, so regular spinning disk storage might be fine. In Kubernetes, you can mix and match storage classes and request which one to use as part of the deployment configuration for applications.
Networking
Last, we have layers 4 and 5 in Figure 4-1, the routing (networking) and service layers (see Figure 4-5). These layers enable access to the applications running on the pods to external systems. The service layer provides load balancing across replicated pods in an application deployment. The routing layer gives developers the option to expose IP addresses to external clients, which is useful if you have a user interface or REST API endpoint that needs to be accessible to the outside world.
If you want to learn more about Kubernetes architecture in general, visit Kubernetes Concepts. A deeper dive into OpenShift’s architecture can be found here.
Data and Software Layers
Now that you know a bit more about the Kubernetes platform architecture, let’s get into what it means for data. A good platform architecture should remove barriers to processing data efficiently. Data engineers need an environment that scales to handle the vast amounts of information flowing through their systems. Many intelligent applications are built with machine learning models that require low-latency access to data even as volumes grow to massive scales.
Hadoop Lambda Architecture
One of the more popular pipelines for low-latency processing of big data stores is the lambda architecture. With the increased popularity of streaming data, IoT devices, system logs, multiple data warehouses, and unstructured data lakes across organizations, new challenges have surfaced for keeping up with demands for analyzing so much information.
What if we wanted to analyze the sentiment of incoming tweets from a live event on Twitter? A first attempt at building an intelligent application for this might be a batch job that runs nightly. But what if key business decisions need to be made in minutes instead of days in order to react quickly during the event? Even if your nightly job can keep up as the volume of tweets increases over time, it still isn’t capable of reacting in minutes.
With a lambda architecture, engineers can design applications that not only can react to incoming streams of data in real time but also can gain additional insights by running more detailed data processing in batch periodically. Various Hadoop distributions have become a fixture in many data centers because of their flexibility in providing such capabilities. If you were to use Hadoop to deploy a lambda architecture for sentiment analysis detection, it might look something like Figure 4-6.
In the Hadoop lambda architecture, applications are typically deployed on physical or virtual machines. Kafka and Spark Streaming are used to process tweets and analyze sentiments in real time. The results are then sent to an external system where actions can be taken immediately. In this example, an Apache Cassandra database stores the results and is queried in real time.
The batch pipeline of the architecture is used to train the sentiment analysis model periodically as new data comes in and to validate the changes against a set of known data points. This helps to ensure that the model maintains accuracy over time. The new model is then re-deployed to the real-time pipeline, which is a Spark Streaming application in this case.
Now let’s see what this same architecture looks like with Kubernetes.
Kubernetes Lambda Architecture
On the surface, these two implementations look very similar (see Figure 4-7), and that’s a good thing. You get all the benefits of a lambda architecture with Kubernetes, just as you could if you ran these workloads on Hadoop in the public cloud or onto your own data center. Additionally, you get the benefit of Kubernetes orchestrating your container workloads.
One change from our two lambda example implementations is that an object store is used to collect data instead of HDFS. This allows for separating storage from compute in order to scale out the system hardware independently based on demand. For example, if you need more nodes due to increased traffic in Kafka, you can scale out without impacting where the actual data might be stored. Another change is using containerized Kubernetes pods. This results in resources of the various services being managed and allocated on-demand both for real-time and batch pipelines.
Finally, since Kubernetes has orchestration built into it, we can use a cron job for simple model training instead of a workflow scheduler such as Apache Oozie. Here is an example of what the YAML definition for the training job might look like:
apiVersion
:
batch/v1beta1
kind
:
CronJob
metadata
:
name
:
sentiment-analysis-train
spec
:
schedule
:
"0
22
*
*
*"
jobTemplate
:
spec
:
template
:
spec
:
containers
:
-
name
:
sentiment-analysis-train
image
:
sentiment-analysis-train
env
:
-
name
:
MODEL_NAME
value
:
"${MODEL_NAME}"
-
name
:
MODEL_VERSION
value
:
"${MODEL_VERSION}"
-
name
:
NUM_STEPS
value
:
"${NUM_STEPS}"
restartPolicy
:
OnFailure
parameters
:
-
name
:
MODEL_NAME
description
:
Name of the model to be trained
value
:
SentimentAnalysis
-
name
:
MODEL_VERSION
description
:
Version of the model to be trained
value
:
"1"
-
name
:
NUM_STEPS
description
:
Number of training steps
value
:
"500"
For more information on CronJob
objects, visit the documentation.
Each part of the pipeline is also deployed as a microservice. This includes the Kafka cluster, object store, Spark MLlib, Spark Streaming application, and Cassandra cluster. Peeking into the deployed pods on the nodes may reveal a workload orchestration similar to Figure 4-8.
In a production Kubernetes cluster where performance is critically important, having your Kafka pods on the same node as the object store pods could create CPU or memory resource contentions during high volume workloads. To maximize the usage of your nodes and minimize contention with applications of drastically different usage patterns, Kubernetes features such as nodeSelector and affinity and anti-affinity rules can be used to ensure only certain workloads are deployed to specific nodes indicated by labels. A better, more scalable orchestration might look like Figure 4-9.
You can read more about this in the documentation on pod assignment.
Summary
Not every architecture will look exactly the same, but some useful and reusable patterns will emerge in many places that look quite similar to one another. Some of the suggestions here should help you structure your architecture properly. If you have to make changes as you go, don’t sweat it—one of the greatest advantages of deploying with Kubernetes is its flexibility.
In Chapter 5, we’ll take a look at some specific examples that build on these architecture concepts to power intelligent applications.
Get Open Source Data Pipelines for Intelligent Applications now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.