Telemetry, an essential part of any cloud-native app

Why telemetry, a new factor in app development, can mean the difference between success and failure in the cloud.

By Kevin Hoffman
October 11, 2016
Mir above Earth Mir above Earth (source: NASA via Wikimedia Commons)

In Beyond the Twelve-Factor App, I present a new set of guidelines that builds on Heroku’s original 12 factors and reflects today’s best practices for building cloud-native applications. I have changed the order of some to indicate a deliberate sense of priority, and added factors such as telemetry, security, and the concept of “API first” that should be considerations for any application that will be running in the cloud. These new 15-factor guidelines are:

  1. One codebase, one application
  2. API first
  3. Dependency management
  4. Design, build, release, and run
  5. Configuration, credentials, and code
  6. Logs
  7. Disposability
  8. Backing services
  9. Environment parity
  10. Administrative processes
  11. Port binding
  12. Stateless processes
  13. Concurrency
  14. Telemetry
  15. Authentication and authorization

The concept of telemetry is not among the original 12 factors. Telemetry’s dictionary definition implies the use of special equipment to take specific measurements of something and then to transmit those measurements elsewhere using radio. There is a connotation here of remoteness, distance, and intangibility to the source of the telemetry.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

While I recommend using something a little more modern than radio, the use of telemetry should be an essential part of any cloud-native application.

What kind of telemetry?

Building applications on your workstation affords you luxuries you might not have in the cloud. You can inspect the inside of your application, execute a debugger, and perform hundreds of other tasks that give you visibility deep within your app and its behavior.

You don’t have this kind of direct access with a cloud application. Your app instance might move from the east coast of the United States to the west coast with little or no warning. You could start with one instance of your app, and a few minutes later, you might have hundreds of copies of your application running. These are all incredibly powerful, useful features, but they present an unfamiliar pattern for real-time application monitoring and telemetry.

Note Treat apps like space probes

I like to think of pushing applications to the cloud as launching a scientific instrument into space.

If your creation is thousands of miles away, and you can’t physically touch it or bang it with a hammer to coerce it into behaving, what kind of telemetry would you want? What kind of data and remote controls would you need in order to feel comfortable letting your creation float freely in space?

When it comes to monitoring your application, there are generally a few different categories of data:

  • Application performance monitoring (APM)
  • Domain-specific telemetry
  • Health and system logs

The first of these, APM, consists of a stream of events that can be used by tools outside the cloud to keep tabs on how well your application is performing. This is something that you are responsible for, as the definition and watermarks of performance are specific to your application and standards. The data used to supply APM dashboards is usually fairly generic and can come from multiple applications across multiple lines of business.

The second, domain-specific telemetry, is also up to you. This refers to the stream of events and data that makes sense to your business that you can use for your own analytics and reporting. This type of event stream is often fed into a “big data” system for warehousing, analysis, and forecasting.

The difference between APM and domain-specific telemetry may not be immediately obvious. Think of it this way: APM might provide you the average number of HTTP requests per second an application is processing, while domain-specific telemetry might tell you the number of widgets sold to people on iPads within the last 20 minutes.

Finally, health and system logs are something that should be provided by your cloud provider. They make up a stream of events, such as application start, shutdown, scaling, web request tracing, and the results of periodic health checks.

Planning your monitoring strategy

The cloud makes many things easy, but monitoring and telemetry are still difficult, probably even more difficult than traditional, enterprise application monitoring. When you are staring down the fire hose at a stream that contains regular health checks, request audits, business-level events, and tracking data, and performance metrics, that is an incredible amount of data.

When planning your monitoring strategy, you need to take into account how much information you’ll be aggregating, the rate at which it comes in, and how much of it you’re going to store. If your application dynamically scales from 1 instance to 100, that can also result in a hundredfold increase in your log traffic.

Auditing and monitoring cloud applications are often overlooked but are perhaps some of the most important things to plan and do properly for production deployments. If you wouldn’t blindly launch a satellite into orbit with no way to monitor it, you shouldn’t do the same to your cloud application.

Getting telemetry done right can mean the difference between success and failure in the cloud.

Post topics: Operations

Get the O’Reilly Radar Trends to Watch newsletter