Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development
by Dave Ingram
Chapter 21. Designing for Incident Investigation
Incident investigation will often follow an error or abnormal event being raised. Developers are constantly involved in incident investigation — right through the lifecycle, not just in live service. Defects raised during testing need to be resolved quickly and efficiently. It is often during testing that you find that not only are the events inadequate, but the logging, tracing, auditing, and tooling are, too. Developers often spend so much time trying to get the functionality right that they forget the instrumentation and diagnostics. I've mentioned this before, but there's nothing worse than being awoken at 3 A.M. to investigate a problem and the event itself contains no real information, and then, to add further insult to injury, the logs don't tell anything conclusive, either.
Incident investigation is all about getting to the root cause of the problem, re-creating the issue, analyzing and defining a solution, and, ultimately, implementing that solution. The quicker you can achieve this goal, the sooner you can go (back) to bed. You've seen how good diagnostics provide a great starting point. This chapter looks a little bit further into the actual actions that follow and what you can do to really provide value to this process.
This chapter is organized into the following sections:
Tracing — Examines the tracing that should be included in the solution components. It also looks at some important practices that you should consider ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access