Chapter 6. The Work Environment

Humans follow incentives, get easily distracted, and are forgetful. Systems keep evolving. Remember this whenever a human operator is expected to become an integral part of an operational process. Some fundamental problems related to monitoring and alerting are due to making false assumptions about human nature; others are due to putting insufficient weight on the importance of change. In general, the problem stems from the perception of how things ought to be, rather than how they actually are. The system is dynamic, many parts are movable, and it’s only predictable to a certain degree. The people who designed it are most often not the ones in charge of 24/7 operations. For that reason, the work environment should foster a flexible culture, one that assists in the progress of adaptability and encourages growth.

Keeping an Audit Trail

Responding to alerts means dealing with uncertainty. Even in mature IT organizations outages resulting from changes made by operators, such as new software rollouts, configuration updates, and infrastructure upgrades account for more than 50% of all outages. Keeping an audit trail and consulting it during early outage indications can, therefore, reduce the initial uncertainty in every second case, giving the troubleshooter a massive advantage.

An accurate and complete audit trail does not necessarily have to come at a cost of high manual overhead. It can be greatly automated with the help of a publish-subscribe style messaging ...

Get Effective Monitoring and Alerting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.