Chapter 16. Monitoring and Preemptive Support

If you have read this far, you must really be out to impress everyone with a system that is not only scalable and reliable, but also highly available. With the right tools and approach, the five nines once reserved for telecom systems are now easily attainable in whatever other vertical for which you might be developing software. But implementing everything described in the previous chapters is not enough. Just as important as resilient software, redundant hardware, networks, power supplies, and multiple data centers, your secret sauce to high availability is achieving a high level of visibility into what is going on in your system and the ability to act on the information you collect.

Your DevOps team will use all this information for two purposes: preemptive support and postmortem debugging. Monitoring the system will allow them to pick up early warning signs and address problems before they get out of control, either manually or through automation. Is your disk filling up? Trigger a script that does some housekeeping by deleting old logs. Has your load been increasing steadily over the past months as a result of an increase in registered users and concurrent sessions? Deploy more nodes to help manage the load before running out of capacity.

No matter how much of an optimist you might be, you will not be able to catch all problems and bugs before they manifest themselves. Sometimes things go wrong, making you rely on higher layers of ...

Get Designing for Scalability with Erlang/OTP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.