This book is about monitoring with Graphite. One of my goals in teaching users how to use Graphite effectively is not just how to monitor “better,” but how to store and retrieve data in a way that helps us manage risk, predict capacity shortfalls, and align our IT strategy with the overarching business needs of the organization. Monitoring with Graphite trains us to be able to answer more pertinent questions than, “Is my server alive?” We should be able to quickly ascertain: “Are customers suffering a degraded experience?”, “How has our sales team performed over the last quarter?”, and “Do we have enough capacity to absorb a Reddit front-page posting?”
Virtually every activity we perform can be measured and analyzed. Weather forecasters use climactic data to predict changes in temperature, barometric pressure, and wind speed. Sports statisticians analyze trends by players and teams to identify the most productive athletes and most effective coaching strategies. Fans of the Adventure Time with Finn and Jake show record the number of times that Lumpy Space Princess utters “oh my Glob” in a single season. And most important to us (I hope), software engineers and systems administrators use telemetry—recorded measurements—to measure the effectiveness of their system designs and deployments.
The type of questions we ask of our data influences the way we store, retrieve, and interact with our data. Spreadsheets created with Excel; rich, structured data logged to Hadoop; and time-series metrics saved in Graphite all have strengths and weaknesses, but were designed with a specific goal or workflow in mind. As practitioners of this information, we need to understand the use cases behind each tool in order to maximize our effectiveness while working with our data.
In this first chapter we’re going to get a quick primer on time-series data before diving into the history and goals of the Graphite project. With this background we can better understand what differentiates it from other monitoring and trending software applications, and determine whether it’s a good fit for your organization.
If you’re already familiar with the topics I’ve mentioned, you might want to skip ahead to Chapter 2. Although some of the topics in that chapter may be refreshers for some readers, many of the underlying principles are founded on years of personal experience building large-scale telemetry systems for companies like Heroku, GitHub, and Dyn. To get the most out of this book and my presentation of the materials within, I caution you against jumping too far ahead.
With that caveat out of the way, let’s dive in.
Finn the Human
We interact with time-series data virtually every day of our lives, even if we don’t immediately recognize it as such. Weather reports, financial tickers, and even the audio visualizer on SoundCloud use time series to convey how the things we care about, such as personal investments or the chance of rain at lunch, into an easily understood visual format that we can absorb quickly and easily.
But what is time-series data, really? Nothing more than a sequence of values collected at regular, uniformly spaced intervals. If I asked you to step outside every hour and give me your best estimate of the current temperature in degrees Fahrenheit, the act of recording those numbers in a computer (or on even a sheet of paper) would constitute time-series data collection. Certainly this isn’t the most scientific method for tracking our local weather patterns, but it’s not about the accuracy of the data (in our case, anyway) so much as that we’re merely performing the experiment regularly and uniformly.
Depending on the sort of work you’re tasked with, your data may align well with traditional time-series storage engines or perhaps even a specialized analytical database system. From my own personal experience, data is data; the differences lie in the sort of questions you wish to ask of your data.
System administrators and web engineers are generally more interested in the rate of change and seasonality, in order to quantify the health or Quality of Service (QoS) of a particular application or host. They want to know when their services are degraded or might be trending towards a bad state.
Analysts, on the other hand, are typically looking for trends in user behavior. They are often more focused on the distribution of a subset of data or metadata. Unique events carry special meaning for them, so the ability to correlate these events using tags (or labels) allows analysts to classify users and their behavior in such a way that pertinent questions can be asked of seemingly random data. But since you’re reading this book, I’m going to assume your current job (or hobby, I won’t judge you) relates more to the type of questions that are best answered with time-series data and tools like Graphite.
Put simply, a time-series database is a software system that’s optimized for the storage and retrieval of time-series data. And although nothing else I say throughout the rest of this book will change that, I think it’s useful to talk at greater length about some of the low-hanging fruit in terms of TSDB (a popular acronym for time-series databases, and one I’ll use repeatedly throughout this tome) performance and maintenance, since this is often a major factor when evaluating software trending systems like Graphite.
It’s not that the average user should need to think about or interact directly with these data stores under normal operation, but without a database tuned and optimized to handle the workload of a high-performance data storage and retrieval system, the user experience will be dreadful. Therefore, it’s important that we start with at least a nominal understanding of TSDB performance patterns. Your future self will thank me.
There’s been an explosion in the number of open source projects and commercial monitoring products over the last few years. Advancements in solid-state drive (SSD) storage and the persistence of Moore’s Law have made it technically and financially feasible to collect, store, retrieve, correlate, and visualize huge amounts of real-time monitoring data with commodity systems (read: the Cloud). Distributed database systems (particularly NoSQL), themselves largely driven by competition in the pursuit for Big Data analytics, have made it easier than ever before to scale out horizontally and add capacity as demand necessitates.
Why do I place such an emphasis on storage? Because without fast storage (and a lot of it), we wouldn’t be able to persist all of this wonderful data to disk for later retrieval. Of course, it helps to understand what I mean by “fast storage.” Frankly, it’s not that hard to write to disk quickly, nor is it difficult to read from disk quickly. Where the challenge arises is in doing both at the same time.
In terms of operating system design (the software that powers your computer), the kernel is the brains of the operation. It’s tasked with a huge variety of administrative duties, from managing available memory to prioritizing tasks to funneling data through the network and storage interfaces. Depending on the conditions involved and configuration applied, the kernel must make compromises to handle its workload as efficiently as possible. But how does this apply to us?
Remember that scene from the movie Office Space where the boss walks over to his employee’s cube and asks him to work over the weekend? Let’s pretend we’re the boss (“Lumbergh”) and the kernel is the employee (“Peter”).
Lumbergh: Hello Peter, what’s happening? Ummm, I’m gonna need you to go ahead and process these 20 million disk writes as soon as possible. So if you could have those done in 15 milliseconds that would be great, mmmkay?
Peter: [sullen acceptance]
Lumbergh: Oh, oh, and I almost forgot. Ahhh, I’m also going to need you to return a composite query for the 95th percentile of all derived metrics, immediately. Kay? Uh, and here’s another 30 million disk writes. I can see you’re still working on those other 20 million writes, so we need to sort of play catch-up. Kay?
Peter: [muffled sobbing]
This is a fictional interaction, but the demands we’re placing on the kernel (and filesystems) are all too real. Writing to and reading from disk (also known as input and output, or simply I/O) are expensive operations if you’re trying to do a lot of one and any of the other at any point in time.
A popular technique for optimizing writes is to buffer them in memory. This is fine as long as you have enough memory, but you need to flush these to persistent disk at routine intervals or risk losing your data in the event of a system failure.
On the other hand, it’s very effective to use in-memory caches to keep “hot copies” of data for queries (reads). Again, this is an effective approach as long as you have enough memory and you expire your cached answers frequently enough to ensure accurate results for your users.
Graphite uses both of these techniques to aid the performance of its underlying time-series database components:
A network service that listens for inbound metric submissions. It stores the metrics temporarily in a memory buffer-cache for a brief period before flushing them to disk in the form of the Whisper database format.
The specification of the database file layout, including metadata and rollup archives, as well as the programming library that both Carbon and the Graphite web application use to interact with the respective database files.
We’ll touch on both of these components in greater depth later on. For now, it’s enough to understand their respective roles in the Graphite architecture.
The ability to store and retrieve time-series data quickly—and at high volume—is key to the success of Graphite. Without the ability to scale, it would be impractical to use Graphite for anything besides small teams and hobby projects. Thanks to the design of Carbon and Whisper, we can build significant clusters capable of processing many millions of datapoints per second, making it a suitable visualization tool for virtually any scenario where time-series analysis is needed.
Once upon a time, years even before the invention of Nagios, a Swiss man named Tobias Oetiker worked at the De Montfort University in Leicester, UK. Looking for a method to track network activity on their lone internet connection, Tobias developed a small Perl script for monitoring traffic levels on their network router. Querying SNMP interface statistics every five minutes, it would use this data to generate a series of graphs detailing current and past network levels.
This tool came to be known as the Multi Router Traffic Grapher (MRTG). If you worked for an internet service provider or telecommunications company in the 1990s or 2000s, there’s a good chance you were exposed to this tool or its spin-off, the Round-Robin Database (RRD). I was fortunate to work for early internet companies like Digex and Cidera where it became ubiquitous, and it almost certainly set the foundation for my early interest in monitoring practices and technologies.
Years later, Chris Davis, an engineer at Orbitz, the online travel agency, began piecemeal development on the components that would later become known as Graphite. The rendering engine began like so many other great hacks: he just wanted to see if he could build one. It was designed to read RRD files and render graphs using a URL-based API.
Whisper, the database library, was a desperate attempt to fix an urgent bug with RRD before a critical launch date. At the time, RRD was unable to accept out-of-sequence data; if metric A with a timestamp of 09:05:00 arrived after metric B with a timestamp of 09:05:30, the former metric would be completely discarded by RRD (this was later redesigned). Whisper addressed this design shortcoming specifically, and at the same time, drastically simplified the configuration and layout of retention periods within each database file.
The Carbon service marked the introduction of a simple network interface abstraction on top of Whisper that enabled anyone with a computer to submit metrics easily. It evolved to add carbon-cache, an in-memory buffer cache and query handler, addressing both performance and the need for real-time query results:
carbon-relay, a daemon capable of load-balancing or replicating metrics across a pool of carbon-cache processes; and
carbon-aggregator, built to aggregate individual metrics into new, composite metrics.
By 2008, Chris was allowed to release Graphite as open source. Before long, it was mentioned in a CNET article, and then on Slashdot, at which point adoption took off. In the years since, an entire cottage industry has built up around the Graphite API, and countless businesses rely on it as their primary graphing system for operational, engineering, and business metrics.
While Graphite continues to evolve and add new features routinely, much of its success stems from its adherence to simple interfaces and formats, resulting in a shallow learning curve for new users. Metrics are easy to push to Graphite, whether from scripts, applications, or a command-line terminal. Graphs are crafted from the URL-based API, allowing them to be easily embedded in websites or dashboards. The web interface empowers users to prototype graphs quickly and immediately, with virtually no training or formal instruction necessary. And a huge community of users among a range of diverse backgrounds and industries means that new and unique rendering functions are always being contributed back upstream to the project.
Arguably one of Graphite’s most important attributes, the metrics format for submitting data to the Carbon listener, is beautifully simple. All that’s required is a dot-delimited metric name (e.g.,
"foo.bar"), a numeric value, and an epoch timestamp. Instrumenting your application to send telemetry can be done in a couple of short of lines of code or, if you’re impatient, with the help of some basic UNIX commands in your shell terminal, as shown in Example 1-1.
$ echo "foo.bar 41 `date +%s`" | nc graphite-server.example.com 2003
In this example, we’re using the
echo command to print the metric string
"foo.bar" along with my favorite prime number and an epoch timestamp generated by the
date command. The output is piped into netcat (
nc),1 a little network utility that connects to our imaginary Carbon service at
graphite-server.example.com TCP port 2003, and sends our data string. The trailing newline character provided by
echo notifies the Carbon server that we’re finished and the connection can be closed.
An epoch timestamp represents the number of seconds that have passed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not including leap seconds. This might be useful the next time your UNIX-friendly significant other asks you for the epoch time on Valentine’s Day 2014 (1392354000).
Unlike its predecessors and many of its contemporaries, Graphite doesn’t rely on static configurations or batch jobs to create new graphs. All of its data rendering—from its traditional PNG charts found in the web interface to the JSON output used by client-side libraries to compose elaborate dashboards and information graphics—is constructed on-the-fly using its comprehensive API.
Every single graphing feature available to Graphite users is exposed via this API, precisely because its web UI consumes the very same interface. For Graphite developers and users, this is the ultimate case of “eating one’s own dog food.”
Of course, this makes sharing graphs with anyone as easy as sharing the URL that constructed it. Even better, that same URL can be embedded in your company dashboards or even a website. As long as the person loading the graph has network access (and in the case of password protection, the necessary credentials) to the web server running your Graphite application, she should be able to see exactly the same data you’re viewing.
Thanks to the popularity and stability of Graphite’s API, a huge ecosystem of third-party tools and services have grown up around it. In fact, not only do a wide variety of applications consume the Graphite API, but some projects have even developed backend service adapters to enable Graphite to speak to their own storage systems.
As fantastic as the Graphite API is (and it really is), most users’ first encounter with Graphite is the web interface, and more specifically, the Composer (Figure 1-1). While nobody’s going to confuse its design for something released by Apple and Jony Ive, it does a great job where it matters: navigating metrics and saved graphs, adding to and removing metrics from graphs, applying statistical transformations, and exposing the full breadth of the rendering API.
When talking about everything the Composer is capable of, it’s easy to get lost in the woods and lose track of our mission: visualizing and correlating our data. Fortunately, the utilitarian nature of the Composer’s interface makes it perfectly suited to doing just that. Navigating the Metrics tree of nested folders and metric names is a familiar experience. Clicking on a metric name adds its data series to the Graphite Composer window frame in the center of the screen. Transformative functions are easily applied from the Graph Data dialog window, resulting in new and exciting ways of interpreting the data. Switching from a line chart to an area chart is one click away in the graph options menu.
Best of all, the graph automatically refreshes after every action to reveal the new arrangement. Feedback is instantaneous and intuitive. Graph URLs can be easily copied and shared with your peers. Those same people can make adjustments and pass back the new URLs, which are just as easily loaded back into your Composer again. Before long, you’ve mastered a workflow that empowers you to rapidly isolate anomalous behavior and identify causal relationships in a manner lacking in most traditional monitoring and trending systems.
But perhaps more than anything else, what really distinguishes Graphite from every other commercial or open source monitoring system out there is its exhaustive library of statistical and transformative rendering functions. As of version 0.9.15, there are 88 documented functions used to transform, combine, or perform statistical computations on series data using the render API.
You probably won’t be surprised to learn that Graphite includes primitives to help identify the min, max, or mean of a series. You might even be pleased to see that it can calculate the percentiles and standard deviation of your series. But I think you’ll be truly impressed with its breadth of coverage when you discover its support for various Holt-Winters forecasting algorithms and virtually every sorting and filtering criteria you could imagine.
How is this possible? The Graphite community is wonderfully large and varied. Users represent a number of different industries and vocations: sofware engineers track application performance and profiling data, system administrators feed in server data to track resource utilization, marketers monitor user activity and campaign results, and business executives correlate quarterly results with sales numbers and operational expenses. At one time or another, every one of Graphite’s rendering functions was contributed by someone who discovered a new use case or algorithm for his task, developed a function to process the data as required, and then submitted this enhancement back to the Graphite project in typical OSS fashion.
Graphite treats each series on a chart as a stream of data. The original raw data can be passed into a rendering function and the output of that function can then be passed onto another function—lather, rinse, repeat! Once it has determined there are no additional transformations, Graphite passes the series on to your choice of rendering formats (e.g., graph, CSV or JSON response, etc.).
Some time ago, I worked on the operations team at Heroku, managing a huge fleet of Amazon Elastic Compute Cloud (EC2) instances in production. As time went on, we began to notice an abnormally high rate of system failures for a particular type of instance. Fortunately, we were able to instrument our platform services to fire off a Graphite metric (
ec2.instances._hostname_.killed) every time one of these instances became unresponsive, forcing us to terminate and replace it in the cluster. The metric was preconfigured with a resolution of 15 minutes for up to one year.
By chaining together some common render functions, we were able to construct a target query string that counted the number of failed instances and summarized the results into 24-hour buckets. Most render functions accept one or more series (or a wildcard series) and one or more arguments within parentheses. Each subsequent function wraps around the one preceding it, resulting in a nested sequence of input and output. The end result would’ve looked something very much like the following query:
In this example, we used the
sumSeries() function to aggregate all of the matching metric series described by our wildcard metric string,
ec2.instances.*.killed. This returned a single series representing the sum of all terminated instances. However, our Amazon representative asked us to report our counts as a daily total, so we took that result and fed it into the
summarize() function (with the
"1d" interval argument) to group our results into 24-hour resolution windows (Figure 1-2).
There’s a seemingly infinite number of tasks Graphite is well suited to handle. Because Graphite makes it so easy to store and retrieve virtually any type of numeric data, you’re almost as likely to find chief executives or business development analysts using it as you would system administrators and web developers. But rather than take my word for it, let’s hear from a few businesses that use Graphite to power everything from application and security monitoring to business decisions.
Like Orbitz, the company behind Graphite, Booking.com is one of the Internet’s busiest online travel agencies. They also happen to run one of the world’s largest Graphite installations. According to Brad Lhotsky, systems security team lead, all of Booking.com’s primary system, network, and application performance data is stored in Graphite, alongside segments of the company’s security and business intelligence:
We track both technical and business metrics in Graphite. It allows us to understand the business costs of outages or performance degradations easily. It also allows us to correlate business trends with system-, network-, and application-level trends. This leads to more informed decision-making by the technical people and confidence in the technical teams by the business folks.
The scope of the business-level monitoring we do with Graphite is significant. Most technical decisions are made only once business value is understood. Graphite provides a convenient way for technical and nontechnical people to access both the technical and business data in an easy-to-understand way.
Graphite provides a simple API enabling application developers, system administrators, network engineers, and Security Engineers to store time-series data at will. This means they can correlate data across all these disciplines to understand the full impact of code changes, network infrastructure changes, or security policies in near real-time.
Millions of people around the world use GitHub to help manage their software development pipeline and to collaborate with like-minded users. Graphite provides GitHub’s engineering and operations teams with the tools they require to collect and correlate massive amounts of time-series data. Scott Sanders, infrastructure engineer at GitHub, describes how he uses Graphite to comprehensively monitor GitHub’s host, application, service, business, and financial metrics:
In our operations team, we use Graphite to drive many of our alerting systems and perform capacity analysis. It’s also heavily used by our development teams to analyze the performance impact of changes as they’re implemented and slowly deployed across our infrastructure. Our business and sales teams use Graphite to track impact and effectiveness of their campaigns and initiatives.
We interact with a large portion of our graphs through ChatOps. A culture of data-driven investigation through chat has enabled us to interact with our data in a manner visible to the entire team participating. In addition, this creates an audit log allowing us to assess our successes and failures well after the event has ended.
Graphite’s scalability and flexibility allows it to be used as a common utility across all departments within GitHub, making it much easier for teams to collaborate and share information using the same key concepts and interfaces.
Etsy is an online marketplace bringing artists, craftspeople, and consumers together to buy and sell unique and homemade goods. Daniel Schauenberg, staff software engineer at Etsy, describes how Graphite fits into its self-service engineering culture to measure and track website performance and customer experience:
We use Graphite heavily to track and monitor etsy.com application metrics. This includes performance of different things like page render times or database requests. These timings come from StatsD, which is our biggest ingress point for Graphite data. We also use counts in there heavily to get the number of, for example, logins or checkouts at any given point.
Graphite makes it really easy to create metrics ad hoc. This means a developer can add instrumentation to their feature with a single line of code and immediately get feedback. It is very self-service and doesn’t require anyone to enable something or explicitly grant access.
One of the largest and most successful video game publishers in the world, Electronic Arts (EA), has to track performance data for hundreds of millions of customers. Steve Keller, systems architect, monitoring at Electronic Arts, talks about how EA uses Graphite to gain insight into platform and user telemetry at a significant scale:
I first started using Graphite about 3.5 years ago to graph data from our monitoring systems. Within a year, I found other teams at EA had begun to use Graphite for collection of internal metrics. We eventually created a larger Graphite infrastructure to combine data from all sources, and now my team manages several Graphite clusters for multiple teams at EA.
Today, countless dashboards and real-time monitoring consoles have been created for EA around the world; we rely on Graphite to give us a very important view into systems performance and the player experience.
Whether your job title is system administrator, software developer, QA engineer, or CEO, it’s crucial that your systems and applications are measured accurately and continuously. Without real-time monitoring and a high-performance analytical data store, we lack the perspective to qualify our current, past, or future performance. Business decisions are increasingly data-driven, and Graphite provides all the tools to help users collect, store, retrieve, and analyze data quickly and effectively.
If this piques your interest, I urge you to grab a caffeinated drink and read on. It’s about to get real up in here.
1 Newer Linux distributions may offer the
socat utility in place of
netcat. Either one should work fine for this task.