Monitoring cloud-native applications
What to consider when evolving your monitoring strategy as you adopt a cloud-native approach.
What to consider when evolving your monitoring strategy as you adopt a cloud-native approach.
Cloud-native applications are increasingly being adopted across industry. Today it is no longer only companies like Netflix, Amazon or Gilt who build these highly scalable and dynamic applications. At least every other week I learn about companies—even those in very traditional spaces like finance and insurance—rebuilding their application stack to move towards a cloud-native approach. As we now see increased adoption of cloud-native applications we can also see how it affects the way we monitor applications. At first you might think that these applications are “just another technology to monitor”. Once you take a closer look, however, you realize that this is not just about a new set of technologies, but rather an entire paradigm shift in how we build, manage and run applications. Best practices from five years ago have become anti-patterns.
Some examples of these anti-patterns and their “best practice” counterparts
Before we dive into the details of the impact of this new paradigm on monitoring, let’s define what a cloud-native application looks like.
Unlike the name suggests, cloud-native applications are about much more than just the cloud. In fact, “the cloud” has evolved to mean much more than just a set of compute resources. Today “the cloud” refers to an entire paradigm about how applications are built, whether on-premise or off. We can break this paradigm down into a set of key aspects.
With all these substantial changes to the way we deliver applications, we need to start to wonder about what the effect on monitoring is. Can we still use our good old monitoring stack or is something new or different required? From a 10,000 feet view not so much has actually changed. We still monitor infrastructure and applications. The devil is in the details. Having had the chance to look at and work with companies adopting cloud-native applications, here is what we have learned.
Depending on the maturity of your organization, your monitoring stack might be rather mature already. A combination of infrastructure monitoring covering hosts and the network, application monitoring covering service level data like errors or response times, real-user monitoring covering end-user response times and errors as well behavioral data and log monitoring goes a long way. These information sources are as valuable in a cloud-native world as they were with traditional applications.
So while the overall approach to monitoring does not need to change, it is in the details and exact requirements where you need to evolve your monitoring practices. If your monitoring stack, however, leaves some of the areas mentioned above as blind spots, you need to complete this picture. In a traditional datatacenter, it was fine for the development team to not worry too much about the infrastructure as this was the responsibility of dedicated teams. In a cloud-native scenario, however, the development teams have end-to-end responsibility including a well-functioning infrastructure.
Polyglot development means using languages and databases as we see fit. Traditionally, companies defined a common stack to use throughout the entire company. Java EE used to be a very prominent example here. The same was true for database. Companies rarely used more than one database vendor for the current technology stack. In modern environments this standardization is replaced by autonomy in the individual teams. If a team decides that Node.js is a better fit for their API than Java, they go forward with this decision. This trend has naturally increased the complexity of the environments we need to monitor. While traditionally it was easy to built up the expertise to manage applications, in a cloud native context, this knowledge is now spread across a potentially large number of specialized feature teams responsible for product search, payment, special offers calculation or customer account management.
These environments also require a monitoring approach that can cover a much broader range of technologies. At the same time standardized means to describe monitoring data are required.
statsd is one candidate here for simple metrics.
Web-scale applications have been the unicorns of our industry like Facebook, Google etc.. In reality, very few of us are actually building applications at the scale of a Twitter or a Facebook. However, microservice systems expose some similar complexity even for applications that are much smaller. What used to be a three-tiered architecture can today be set of over 100 interacting services that are deployed in multiple versions in parallel and updated in a “chaotic” scheme.
This drives the need for much more automation in monitoring and monitoring tools that can visualize and communicate the state of a system that consists of a large number of entities.
Monitoring always should be accessible to everybody; especially organisations that follow a DevOps approach. Still in many organizations operations is assigned to a specific set of people who use dashboards that are highly streamlined to their needs. In a cloud-native environment every feature team needs to get access to monitoring data in a self-describing way. This is obviously valid for metrics like response times, errors and load and log files. However, metrics should not stop there. A team might decide they need A/B testing-relevant information about feature use or specifics about architectural decisions they have take like data caching strategies.
Monitoring interfaces need to be much more intuitive. Monitoring tools also need to come with pre-built configuration for a vast set of technologies as people lack the skills and time to configure a monitoring tool.
Anomaly detection is currently a hot topic in the operations space and there is a good reason for that. As deployments happen frequently and in parallel it becomes harder to detect changes in a system. The number of individual components is very high and even the detailed behavior of all components is not well understood.
Anomaly detection comes in handy as it provides a means to automatically learn how a system is behaving and will also detect and judge the significance of change in behavior automatically.
“A healthy mind lives in a healthy body.” This was also true for monitoring for a long time. If the infrastructure is not healthy the application will suffer. While this is still true, systems cope with unhealthy infrastructure much differently. Unhealthy nodes are simply killed and replaced by new ones on the fly. This makes systems resilient. Cloud-native applications come with automatic health check mechanisms from simply watching the CPU or a health URL that is checked in frequent intervals. If these checks fail, mechanisms like Autoscaling Groups or cluster management technologies like Mesos simply replace the faulty or unhealthy nodes with new ones.
Monitoring needs to be aware of this resilience. Infrastructure-focused monitoring simply cannot cope well with these environments. The focus of monitoring shifts significantly away from the infrastructure to the application level. CPU and memory metrics have only very little relation to the health of an application. Usually when systems run short of resources auto-scaling while provide additional instances.
Much more important are response time and error metrics at the application and individual service level. These metrics indicate whether a service is actually providing the service it is supposed to. APM tools are naturally built to provide this information as their primary focus was the application.
This, however, does not mean that infrastructure metrics have lost their importance entirely. They provide a valuable source in root cause analysis and for capacity planning. Application health on the other side is better collected from metrics that are related directly to the application.
Capacity planning used to be an exercise you better get right the first time. Provisioning new hardware took a long time while in the cloud it is a simple API call. While this was initially great to scale up as needed it also helps to scale down. This provides a whole new application area for monitoring tools.
While monitoring tools were used to figure out whether too much compute power was used, they now need to report when there is too much capacity available. As cloud environments are paid on a consumption-based model, shutting down a machine for a couple of hours yields to immediate cost saving. We have seen this leading to CIOs not only wanting to see their usual SLA reports but also get details on capacity usage to identify the potential for saving money.
These dynamic platforms need to be automated as well. This is where orchestration tools like Marathon, Mesos, and Kubernetes come into play. We have seen people using these tools and simply forgetting to monitor them. However, when they break your environment can not only no longer run on autopilot but also becomes incredibly hard to manage manually.
Monitoring tools therefore need to be able to manage and monitor these components as well. While there are strong development communities behind these tools they are still pretty young software and also people do not have a ton of experience managing them.
What does my system look like and can I draw it on a whiteboard? This is a key question for everybody running and operating large applications. Configuration management databases (CMDBs) tried to solve this problem by providing a central and single source of truth about configuration and deployment data. Reality, however, showed that they never really worked. The key problem used to be that this information had to be fed into these tools manually from many different sources. The result was that the data often was out-of-date as it was too hard to maintain. The requirement to understand complex applications and get an adequate picture of configurations and the deployment is still there. The task to collect this data obviously becomes even more challenging when applications components, infrastructure and the network is being changed at runtime. There is a big chance that you will not be able to keep your deployment configuration up-to-date.
Monitoring tools usually have a lot of this information available. They run on the OS level and also have insights into application behavior. As they are based on real-time data feeds, they can adjust to change without manual effort and therefore ensure an adequate picture of the application infrastructure. A handy visualization and an API to access this data programmatically provides easy access to this information for a variety of stakeholders. By adding CMDB-like functionality to monitoring systems that collect the data, the manual data collecting and maintenance efforts diminish. This combined with the fact that the data is up-to-date drives significantly higher acceptance of these systems as they start providing actual value by answering questions like “Which services are talking to a database and how did this change since the last deployment”
So, what does this mean for your monitoring tools? Most likely you don’t need to drop everything you have built. At the same time you need to start thinking about how to evolve your monitoring strategy as you adopt a cloud-native approach. As always monitoring should not be an afterthought.
Editor’s note: If you’re considering adopting a cloud-native approach and are looking for a detailed account of the container field, check out The State of Containers and the Docker Ecosystem: 2015 by Anna Gerber.
This post is a collaboration between O’Reilly and ruxit. See our statement of editorial independence.