Foreword
“I know we don’t have tests for that, but it’s a small change; it’s probably fine...”
“I ran the same commands I always do, but...something just doesn’t seem quite right.”
“That rm -rf sure is taking a long time!”
If you’ve worked in software operations, you’ve probably heard or uttered similar phrases. They mark the beginning of the best “Ops horror stories” the hallway tracks of Velocity and DevOps Days the world over have to offer. We hold onto and share these stories because, back at that moment in time, what happened next to us, our teams, and the companies we work for became a epic journey.
Incidents (and managing them, or...not, as the case may be) is far from a “new” field: indeed, as an industry, we’ve experienced incidents as long as we’ve had to operate software. But the last decade has seen a renewed interest in digging into how we react to, remediate, and reason after-the-fact about incidents.
This increased interest has been largely driven by two tectonic shifts playing out in our industry: the first began almost two decades ago and was a consequence of a change in the types of products we build. An era of shoveling bits onto metallic dust-coated plastic and laser-etched discs that we then shipped in cardboard boxes to users to install, manage, and “operate” themselves has given way to a cloud-connected, service-oriented world. Now we, not our users, are on the hook to keep that software running.
The second industry shift is more recent, but just as notable: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access