Monitoring systems become critical as you scale. Effective monitoring can drastically ease the maintenance of services.

Having spoken to multiple experts in this field, this is the advice I have collected on the subject:

  • Choose your key statistics carefully. Users don't care if your machine is low on CPU but they do care if your API is slow.
  • Use aggregators; think about services, not machines. If you have more than a handful of machines, you should treat them as an amorphous blob.
  • Avoid the Wall of Graphs. They are slow and it's information overload for a human. Each dashboard should have five graphs with no more than five lines per graphs.
  • Quantiles aren't aggregable, they're hard to get meaningful information from. However, averages are ...

