Google infrastructure for everyone else
Five questions for Björn Rabenstein: Insights on Kubernetes, Prometheus, and more.
Five questions for Björn Rabenstein: Insights on Kubernetes, Prometheus, and more.
I recently sat down with Björn Rabenstein, production engineer at SoundCloud and co-developer of Prometheus, to discuss Google-inspired tools and why Kubernetes and Prometheus work so well together. Here are some highlights from our talk.
Intriguing, isn’t it? The match is so good that people sometimes claim Prometheus was specifically created to monitor Kubernetes clusters. However, they were developed completely independently. Only in early 2015 did Prometheans and Kubernauts meet for the first time to connect the dots.
In part, it’s a case of convergent evolution. My current employer SoundCloud, where most of the initial Prometheus development happened, needed a container orchestration solution for their growing microservice architecture, and they needed to monitor all of that. No on-premise solutions were readily available, so they had to be created. Our in-house solution for container orchestration was obviously pre-Kubernetes and even pre-Docker. In view of the recent developments, we have deprecated it and are migrating to a Kubernetes setup. However, monitoring our old container orchestration system and the services running on top of it is structurally very similar to monitoring Kubernetes and the services on it. Very few steps had to be performed to come up with a native integration of Kubernetes and Prometheus.
Another part of the answer to your question is what I call the shared spiritual ancestry of both, or, as a colleague phrased it: “They are twins separated at birth.” Kubernetes directly builds on Google’s decade-long experience with their own cluster scheduling system, Borg. Prometheus’s bonds to Google are way looser but it draws a lot of inspiration from Borgmon, the internal monitoring system Google came up with at about the same time as Borg. In a very sloppy comparison, you could say that Kubernetes is Borg for mere mortals, while Prometheus is Borgmon for mere mortals. Both are “second systems” trying to iterate on the good parts while avoiding the mistakes and dead ends of their ancestors.
And by the way, Kubernetes and Prometheus are both Greek 10-letter words. But that is pure coincidence.
My talk at Velocity Amsterdam will cover the technical aspects of pairing Prometheus with Kubernetes in more detail. Stay tuned.
When I joined Google more than 10 years ago, it was like entering a science fiction movie, so fundamentally different was most of the technology used. The only glimpse I had taken in advance was reading the papers on MapReduce and GFS. While Google engineers certainly still live far in the future, I daresay that the technological differences to “normal” mid-size tech companies are much less fundamental these days. The existence of the many Google-inspired tools and technologies out there are in a way both a cause and a consequence of that development.
A growing number of projects were just directly open-sourced by Google, with Kubernetes being a prime example. You can learn a lot about how Google works internally from technologies like protocol buffers and gRPC, and even from the Bazel build system or from the Go programming language.
Another source of inspiration are the many whitepapers Googlers have published (like the aforementioned GFS and MapReduce papers) or just the brains of ex-Googlers that miss the awesome infrastructure after leaving the company. (Rumors that Google would use Men-in-Black-style neuralyzers during offboarding are strongly exaggerated.) In that way, we got the whole Hadoop ecosystem, various other implementations of key-value stores with BigTable semantics, Zipkin and OpenTracing, or CockroachDB. In the metrics space, Prometheus deserves less credit than you might think. I see the data model used by Prometheus as the first principle from which everything else follows. That data model was brought to the world by OpenTSDB as early as 2010 (by a former teammate of mine). Prometheus “merely” added an expression language to act on the data model for graphing and alerting, and a collection path so that metrics reliably find their way from their source in the monitored targets into your monitoring system.
I’m sure I have forgotten many other tools that have deserved to be mentioned here.
The most important paradigm shift is from hosts to services. Not only are single-host failures increasingly likely in large distributed systems, but those systems are explicitly designed to tolerate single-host failures. Waking somebody up in the middle of the night because a single host has stopped pinging is not sustainable. Your monitoring system has to be able to view the service as a whole and alert on an actual impact of the user experience. In the words of Rob Ewaschuk, symptom-based alerting is preferred over cause-based alerting. At the same time, you still need to be able to drill down to explore causes. Jamie Wilkinson nailed that pretty much in one sentence, quoting from Google’s SRE book: “We need monitoring systems that allow us to alert for high-level service objectives, but retain the granularity to inspect individual components as needed.”
Traditional monitoring is pretty much host-based, with Nagios being the proverbial example. While you can try to bend your traditional monitoring system toward more modern ideas, you will bang your head into a wall eventually.
A nice example for a post-Nagios approach toward service-based monitoring is StatsD. It was a huge step forward. However, it has certain issues concerning details of its design and implementation, but most importantly it misses out on the second half of Jamie’s sentence above: How do I go back to inspecting individual components?
I’ll take that as a segue to the Prometheus data model, which I briefly touched on before. In short: Everything is labeled (as in Kubernetes—hint!). Instead of a hierarchical organization of metrics, as you might know it from a typical Graphite setup, labeled metrics in combination with the Prometheus expression language allow you to slice and dice along the labeled dimensions at will. Metrics collection happens at a very basic level, aggregation and other logical processing happens on the Prometheus server, and can be done ad hoc, should you be in a situation where you want to view your metrics from different angles than before. Most commonly, that happens during an outage when there is really no time to reconfigure your monitoring and wait for the newly structured data to come in.
Finally, you really need white-box monitoring, ideally by instrumenting your code. While the idea of black-box probing has many merits, and you should definitely have a moderate amount of black-box probing in your monitoring mix, it is not sufficient for a whole lot of reasons. Fortunately, instrumentation for Prometheus is fairly easy. Very little logic and state is required on the side of the monitored binary. As mentioned above, the logic (like calculating query rates or latency percentiles) happens later on the Prometheus server.
This is a fascinating topic. SoundCloud has not much more than 100 engineers. Google has more than 10,000. I would guess that similar ratios apply to the traffic served by each company. We are talking about roughly two orders of magnitude difference in scale. That’s huge. Still, SoundCloud–and many other similar companies–are big enough to venture into an area where many lessons can be learned from giants like Google. You will certainly need to translate those lessons quite a bit, but the basic ideas are nevertheless applicable. Google’s book about Site Reliability Engineering is a great source of wisdom. The whole GIFEE thing made it so much simpler to put ideas from the book into concrete action. Still, you should resist the temptation to blindly copy “how Google would do it.” You need to carefully consider what’s different in your organization, in terms of scale, infrastructure, workflows, culture, etc. and then decide how the underlying principle translates into concrete actions within your own framework. GIFEE gives you the tools; it’s up to you to use them in ways appropriate for your organization. In case you like anecdotal evidence: It worked out quite nicely for SoundCloud.
Glad you asked because that made me check out the whole schedule and plan where to go in advance. On the other hand (as a true Promethean) I really hate one-dimensional metrics and picking just a few presentations from so many that are interesting for various reasons.
With event logging being another important pillar of monitoring (besides metrics, what Prometheus is for), I really look forward to the two talks about using Elasticsearch for monitoring. Another topic close to my monitoring heart is a good understanding of what anomaly detection can and cannot accomplish, and I hope that Peter Buteneers’s hateful lovestory will cover that. “Unsucking your on-call experience” and “Breaking apart a monolithic system safely without destroying your team” are ventures SoundCloud has gone through, too, and I’m curious about the experiences of other organizations. Finally, I always enjoy Astrid Atkinson’s presentations very much. I have fond memories of the very impressive onboarding session she gave to my group of Nooglers (i.e., new Google employees) back in 2006. She is a great role model.