Spoon feeding in the long run teaches us nothing but the shape of the spoon.
The distributed datastore is down. Writes are failing on all replicas and reads are timing out. The SRE on call checks the monitoring; there are no clues as to the cause, yet it’s clear this key production service is in a bad state: the errors and latency graphs are the only ones going up and to the right. Revenue is being lost. The on-caller declares a production incident.
The VP of engineering storms in, demanding to know what’s going on.
The other SREs in the room just laugh. Why? Because this is Incident Manager, a game designed to teach incident response skills and teamwork, and the current player just drew a bad card.
Incident management is a key SRE skill that can be learned—and it’s much better for your organization’s SLO budget (and the stress levels of your SRE team) to learn it via a fun and effective game rather than during an actual production incident.
Because SREs are both generalists and experts (one of the major reasons why it is difficult to hire them), they are constantly learning.
The skillset of an SRE can span operating system internals, networking, monitoring and alerting, troubleshooting, debugging, incident management, software engineering, software performance, hardware, distributed systems, systems administration, capacity planning, security, and many other areas. Not all SREs are expert in ...