Chapter 11. Incident Investigation

Incident investigation can be one of the most stressful and difficult areas of software development. The stress is compounded by how critical the situation is, how difficult the problem is, and how quickly it needs to be resolved. Some of the most critical situations include production outages and issues that endanger deadlines and milestones. Incidents can occur for many different reasons. Some are caused by defective software (both internal and third-party) or hardware failures. Others can be caused by external influences such as users, transactions, operators, and other situations. For example, a website may suffer performance degradation on occasion because the number of users or transactions exceed tolerances at the time. It can also occur because of backups running in the background or for a number of other reasons. Incidents can range from simple misunderstandings with software usage, configuration, or functionality to complex functional and technical issues. Incident investigation is concerned with understanding what the issue is, when it happened or happens, what caused it, and, finally, what, if any, corrective actions should be put in place to ensure that it doesn't happen again or what should be done if it does happen again. Not all incidents are resolved with software or hardware modification or configuration; some incidents can be resolved by further user and/or operating training, education, and procedures. Incidents will happen, ...

Get Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.