Monitoring systems become critical as you scale. Effective monitoring can drastically ease the maintenance of services.
Having spoken to multiple experts in this field, this is the advice I have collected on the subject:
- Choose your key statistics carefully. Users don't care if your machine is low on CPU but they do care if your API is slow.
- Use aggregators; think about services, not machines. If you have more than a handful of machines, you should treat them as an amorphous blob.
- Avoid the Wall of Graphs. They are slow and it's information overload for a human. Each dashboard should have five graphs with no more than five lines per graphs.
- Quantiles aren't aggregable, they're hard to get meaningful information from. However, averages are ...