Chapter 48. Sometimes the Fix Is the Problem
Jake Pittis
If simpler systems fail less and are faster to restore, why do incident reviews focus so much on adding fixes rather than on removing components and code? When we should be reducing bug surface area and increasing operator understandability, we instead lean toward adding validation, sanity checks, traffic shifting, and synchronization—all things that add complexity. Even just fixing bugs can end up adding extra code and complexity.
Complexity is often justified with the benefits to reliability outweighing the risk of future incidents. At the end of the day, some complexity is necessary for business functionality, just as some complexity is necessary for reliability. But how often do we focus on trying to remove excess complexity?
Incident reviews are a perfect opportunity to target and remove detrimental complexity. Sometimes this can be code that increases bug surface area and sometimes it can be something that makes systems harder to understand and leads to slower incident response. In both cases, if we can show that it contributed to the incident and that it’s not necessary for reliability or business functionality, then it should be considered detrimental complexity and be removed.
Incidents give us the space to zoom out and notice detrimental complexity. If a bug leads to an incident, we can ask ourselves whether ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access