On-callHardoc-g474

Subject Feature flagLevel Senior–Staff~40 minCommon in Code quality & review interviewsIndustries Technology, Software development

Question

A deploy upgrades a shared resilience LIBRARY (circuit breaker / retry) used by the `orders` service from 3.x to 4.x. The team's own feature flags are untouched. Forty minutes after deploy, during a brief downstream blip, `orders` melts down far worse than usual: a tiny 1% downstream error rate amplifies into a 40% orders error rate and a retry storm, where before the same blip was a non-event. Dashboards: outbound retry count to the downstream is ~5x historical; the circuit breaker never opens. Context: library 4.x changed the DEFAULT circuit-breaker behavior — the breaker is now OPT-IN (disabled unless explicitly enabled) where 3.x had it ON by default, and the upgrade didn't add the now-required enable flag. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.