Success depends upon previous preparation, and without such preparation there is sure to be failure.
People ask me about troubleshooting Graphite performance as if it’s some dark art practiced by only the most arcane operators in the bleakest corner of the largest data centers. Fortunately, nothing could be further from the truth.
Many of Graphite’s newer competitors attempt to hide their operational costs with “cluster in a box” designs, leaning heavily on NoSQL-type backends. Sadly, many of these projects are immature and have left a trail of data corruption and data loss in their wake. In many cases, the user is unprepared for the consequences, with little choice but to “blow it all away” and start over again.
By contrast, Graphite embraces traditional UNIX systems to store and retrieve time-series data. It may not be as sexy as the latest data store appearing on Hacker News, but the use cases, tooling, and failure scenarios are well documented and widely understood. The average system administrator understands files and filesystems, and how to repair them when something goes sideways. To paraphrase Dan McKinley, boring technology just gets the job done.
In this chapter, we’re going to build a foundation containing the skills and tools you’ll need to respond to just about any Graphite troubleshooting or scaling challenge. We’ll investigate what happens when you tweak a variety of performance-impacting configuration ...