by David Lutz
A standard classification for incidents gives all involved a common language to describe what’s going on.
Why bother? And why have so many levels?
I think it’s important to track the kinds of things engineers are being woken up for and to deliver a response that’s suited to the problem.
severity levels defined
- Sev1 Complete outage
- Sev2 Major functionality broken and revenue affected
- Sev3 Minor problem, bug
- Sev4 Redundant component failure
- Sev5 False alarm or alert for something you can’t fix
Whenever the pager goes off, it’s an incident. All these kinds of incidents need different responses.
Classifying them might appear difficult. But it isn’t really. Here’s an automotive example.
- Your car runs out of fuel. = Sev1
- Your clutch is busted. You can drive but only in first gear. = Sev2
- One headlight has blown. = Sev3
- You find your car has a flat tyre. You change the tyre and drive to your destination. = Sev4
- The low fuel warning light is stuck on even though you just filled the tank. = Sev5
Everyone in your organization should be trained to use this terminology. Especially front line support people. They should feel comfortable saying “Guys we have a Sev1, call the on-call engineer immediately” if that’s the case.
Track the frequency of these every week. Put ’em in a spreadsheet. Make sure people know what’s going on. If you’re getting alerts for Sev4 and Sev5, you need to change something to stop them. Sleep is precious. We have !SPOF for a reason. Some things are best left till morning to fix. Perhaps the thresholds are set wrong? Don’t alert on something you can’t fix. That’s a deeper problem that you need to address as an organization, not the responsibility of the guy on call.