Improving bad on-call with the Snowball Effect

I’ve worked on several different teams over the past 8 years I’ve worked at Amazon. Each one of them had on-call in which the engineers were on-call to keep the system running 24/7 for a week. If something broke at 2am, they’d get paged to fix it.

Now, Amazon’s a big company. On-call varied quite a bit. Some teams had more ops load, others had barely any. I had my fair share of weeks with lots of tickets, but usually I sought out teams where it was more manageable. However, those engineers frequently struggled to get anywhere, playing a bit of on-call hot potato with the next on-call. Sadly, Amazon largely did not leverage SREs or dedicated support groups except for the most critical systems. I do wish they would have leveraged them.

Here’s my strategy that I employed when I joined a new team.

The Situation

On-call was generally structured as a week long duty that rotated through engineers on the team. On the on duty week, the on-call engineer would be responsible for keeping the system operational and responding to any emergent issues. Tickets were cut based on impact to the business and customers. Sev2s were cut any time there was a customer impacting or soon to be impacting issue, though sometimes they were cut too aggressively and the engineer would either downgrade it or it would be fixed after responding.

The ticket queues would often times have a lot of tickets in it just waiting for a response and nobody would respond because they didn’t have time. Things just stayed bad and engineers would complain about OE load.

Then during the weekly OE hand-off meeting, the previous on-call would go through the list of problems, scroll through dashboards, and give out action items. It was generally a state of despair.

Let’s walk through the strategy I employed to improve my teams.

The Strategy

Accomplishments

To instill ownership into the team, I introduced the goal that every on-call would accomplish one thing to make the next on-call’s life easier. Each on-call was to present their accomplishment during the hand-off. It could be big like building a new ops tool, it could be small like fixing an alarm that was too noisy, but I clearly drew the line between things that were just work as normal vs actually innovative.

Examples of things that wouldn’t qualify, but still important to do:

Resolving a ticket
Executing a manual deployment
Responding to a customer request

Examples of good accomplishments:

Fixing an alarm that was too noisy
Taking steps towards CI/CD by introducing better pipeline approval workflows
Spending time to root cause a reoccurring issue and proposing a permanent fix

Sometimes, engineers claimed they didn’t have time to do this because there were too many issues. This showed good ownership, they wanted to fix all the problems, but that just doesn’t scale. Closing ticket rarely reduces the number of tickets, and eventually as your service scales, you just get more and more tickets. I’ve been there, seen that, but I aligned with the Team Manager and team that it was okay to fix one fewer issue as long as you instead improved one thing.

At first it was tough for on-calls to get this mindset, but week by week, they’d slowly fix things and bit-by-bit common pain-points would be reduced further freeing up time.

Reduce Aggressive Paging

I signed up to receive every an email alert every time an engineer got paged and reviewed each one. I questioned: was there really a problem here? Did you really need to get paged at 2 o’ clock in the morning? Did you do anything? Could it have waited until 9am? How do we prevent it from happening again?

Sometimes the team would say “but this alarm could tell us something’s wrong.” How many times has it caught an actual issue? Are there more direct measurements of a problem. For example, an alarm on number of 4xx response codes on an HTTP server sounds like a good thing to measure, but there’s all kinds of challenges. Do you use a percentage or a raw count (obviously percentages), but what happens if you have a low traffic API that gets a few transactions per second and somebody refreshes on a 404 page? Boom engineer paged. Even if an alarm could raise a problem, if it’s not directly measuring a problem and in the past it’s largely been annoying, it’s better to be pragmatic and just change thresholds or even disable alarms until you’re in a better state.

Use any situation where an engineer got paged to aggressively change thresholds or remove alarms until only high confidence problems are waking engineers up at 2 in the morning.

No doom-scrolling in the OE hand-off

Poorly structured OE hand-off meetings often devolve into scrolling through dashboards pointing to things and saying “oh what’s that spike?” or scrolling through issues that need to be fixed. This quickly wastes the time of the entire team who, especially on conference calls, check out and start doing other things. Everybody knows there are problems.

Instead, the hand-off meetings should focus on what are the key problems that the on-call is facing and what steps should the next on-call take. At the start on their own time, the on-call should review dashboards and the items from the previous rotation and identify one or two goals they’ll get done that week.

Conclusion

This is not a comprehensive list of everything that will immediately improve operations, however by taking a few steps here and there, your team will eventually get better. It’s like a snowball, a few small things here, then bigger and bigger problems, eventually you’ve got a snowman.

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!